[Toybox] Toybox test image / fuzzing

Sun Mar 13 00:34:46 PST 2016

> Unfortunately, the test suite needs as much work as the command
> implementations do. :(
>
> Ok, backstory!

OK, thanks a lot for all the information!  That helps.  I will work on
this.  I think a good initial goal is just to triage the tests that
pass and make sure they don't regress (i.e. make it easy to run the
tests, keep them green, and perhaps have a simple buildbot).  For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.

> Really, I need a tests/pending. :(

Yeah I have some ideas about this.  I will try them out and send a
patch.  I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.

> Building scripts to test each individual input is what the test suite is
> all about. Figuring out what those inputs should _be_ (and the results
> to expect) is, alas, work.

Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
pretty much like this:

http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html

But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed.  It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.

Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc).  I'm sure we can find a few more bugs though.

> There's also the fact that either the correct output or the input to use
> is non-obvious. It's really easy for me to test things like grep by
> going "grep -r xopen toys/pending". There's a lot of data for it to bite
> on, and I can test ubuntu's version vs mine trivially and see where they
> diverge.

Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state.  Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS.  That is
already a great torture test.

FWIW I think the test harness is missing a few concepts:

- exit code
- stderr
- file system state -- the current method of  putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all

But this doesn't need to be addressed initially.

By the way, is there a target language/style for shell and make?  It
looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive?  I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.

What about C?  Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.

> But putting that in the test suite, I need to come up with a set of test
> files (the source changes each commit, source changes shouldn't cause
> test case regressions). I've done a start of tests/files with some utf8
> code in there, but it hasn't got nearly enough complexity yet, and
> there's "standard test load that doesn't change" vs "I thought of a new
> utf8 torture test and added it, but that broke the ls -lR test."

Some code coverage stats might help?  I can probably set that up as
it's similar to making an ASAN build.  (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)

The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.

> Or with testing "top", the output is based on the current system load.
> Even in a controlled environment, it's butterfly effects all the way
> down. I can look at the source files under /proc I calculated the values
> from, but A) hugely complex, B) giant race condition, C) is implementing
> two parallel code paths that do the same thing a valid test? If I'm
> calculating the wrong value because I didn't understand what that field
> should mean, my test would also be wrong...
>
> In theory testing "ps" is easier, but in theory "ps" with no arguments
> is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
> pid of the "ps" binary changes, and the "TIME" of the shell might tick
> over to the next second. You can't "head -n 2" that it because it's
> sorted by pid, which wraps, so if your ps pid is lower than your bash
> pid it would come first. Oh, and there's no guarantee the shell you're
> running is "bash" unless you're in a controlled environment... That's
> just testing the output with no arguments.)

Those are definitely hard ones... I agree with the strategy of
classifying the tests, and then we can see how many of the hard cases
are.  I think detecting trivial breakages will be an easy first step,
and it should allow others to contribute more easily.

thanks,
Andy

 1457858086.0