[Toybox] Toybox test image / fuzzing

Sun Mar 13 14:56:52 PDT 2016

On 03/13/2016 02:52 PM, Andy Chu wrote:
> On Sun, Mar 13, 2016 at 11:55 AM, Rob Landley <rob at landley.net> wrote:
>> On 03/13/2016 01:06 PM, enh wrote:
>>> #include <cwhyyoushouldbedoingunittestinginstead>
>>>
>>> only having integration tests is why it's so hard to test toybox ps
>>> and why it's going to be hard to fuzz the code: we're missing the
>>> boundaries that let us test individual pieces. it's one of the major
>>> problems with the toybox design/coding style. sure, it's something all
>>> the existing competition in this space gets wrong too, but it's the
>>> most obvious argument for the creation of the _next_ generation
>>> tool...
>>
>> I started adding test_blah commands to the toys/example directory. I
>> plan to move the central ps plumbing to lib/proc.c and untangle the 5
>> commands in there into separate files, we can add test_proc commands if
>> you can think of good individual pieces to test.
>>
>> I'm open to this category of test, and have the start of a mechanism.
>> I'm just spread a bit thin, and it's possible I don't understand the
>> kind of test harness you want?
> 
> The toys/example/test_*.c files seem to print to stdout, so I guess
> they still need a shell wrapper to test correctness.

cat tests/test_human_readable.test

scripts/test.sh test_human_readable

(I didn't hook up the scripts/examples directory in the script that
makes the "make test_blah" targets. I should add that, although "make
test_test_human_readable" is an awkward name...)

> That's technically still an integration test rather than a unit test --
> roughly I would say integration tests involve one more than one
> process (e.g. for a system of servers) whereas unit tests are run
> entirely within the language using a unit test framework in that
> language.

/me goes to look up the definitions of integration test and unit test...

https://en.wikipedia.org/wiki/Unit_testing
https://en.wikipedia.org/wiki/Integration_testing

And the second of those links to "validation testing" which redirects to
https://en.wikipedia.org/wiki/Software_verification_and_validation which
implies that testing (like documentation) is something done badly by a
third party team in Bangalore after the original team scatters to the
four winds, so no.

I do "unit testing" while developing, but then I repeatedly refactor
that code as I go. I just split xexit() into _xexit() and xexit() while
redoing sigatexit() to replace atexit(). (Because the toys.rebound
longjmp stuff needs to happen after "atexit" but the standard C
functions don't give you a way of triggering the list early, nor of
removing things from it short of actually exiting.)

If I had a unit test suite for xexit(), I would have made more work for
myself updating them. I'm still trying to get toys/code.html to have
decent coverage of lib, so that other people can use these tools. A test
suite that tests things that have no external visiblity in a running
programming proves what exactly?

My test suite is _deeply_ unfinished, and testing a moving target, but
its eventual goals are:

1) Regression testing.
2) Standards compliance.
3) Coverage of all code paths.

#3 is non-obvious: how does signal delivery work in here, or disk full?
If sed -i receives a kill signal while saving it should leave the old
file in place which means write new .file.tmp and the mv it atomically
over the old one, but kill -9 means when re-run it needs to cleanup
.file.tmp but sed -i doesn't get re-run a lot (like vi would) which
means what we WANT to do is open our tempfile, delete it, write to it,
and then hardlink the /proc/$$/fd/fileno into the new location taking
advantage of proc's special case behavior, but is that portable enough
(sed should work if /proc isn't mounted) and other things may want to
use that so that code should live in /lib and have a fallback path with
atexit() stuff (see lib/lib.c copy_tempfile())...

So if we _do_ make this plumbing, do we test it in sed or do we have a
test_copy_tempfile in toys/examples that specifically tests this part of
the plumbing, and then it's just a question of whether sed uses it? But
if sed _didn't_ use it, we wouldn't notice unless we tested it...

Another coverage vs duplication issue is the fact that every command
should be calling xexit() at the end (including return from main) which
means it does a fflush(0) and checks ferror() and does a perror_exit()
if there was a problem. (Which is why I've been pruning back use of
xprintf() and similar, those cause the program to exit early rather than
producing endless output when writing to a full disk or closed socket,
but the fflush() affects performance and the exit path should notice), but

Possibly what I need is a shell function a test can call that says "this
command line modifies/replaces file $BLAH, make sure it handles disk
full and being interrupted and so on sanely", and it can run it in its
own directory and make sure there are no leftover files if it gets a
kill signal while running" (have it read from a fifo, once we're sure
it's blocked send it non -9 kill signal and then read the directory to
make sure there's only one file in there...)

This is the kind of thing I'm worried about in future. My idea of "full
coverage" is full of that sort of thing. Things which are externally
visible from the command can be tested by running the command in the
right environment.

> Google uses gunit/googletest for testing, and I guess Android does too:
> 
> https://github.com/google/googletest
> 
> Example: https://android.googlesource.com/platform/system/core.git/+/master/libziparchive/zip_archive_test.cc
> 
> You basically write a bunch of functions wrapped in TEST_ macros and
> they are linked into a binary with a harness and run.

This is a set of command line utilities, not a C library.

If, after the 1.0 release, somebody wants to make a C library out of it,
have fun. But until then, infrastructure bits are subject to change
without notice. (I'm curently banging on dirtree to try to get that
infinite depth thing rm wants, for example. Commit 8d95074b7d03 changed
some of the semantics, adjusted the callers, and updated the
documentation. What would altering a test suite at that level
accomplish? Either the behavior is visible to the outside world when the
command runs, or it isn't. I can make a test_dirtree wrapper to check
specific dirtree corner cases, but we should also have _users_ of all
those corner cases, and should be testing the visible behavior of those
users...)

> I guess toybox technically could use it if the tests were in C++ but
> the code is in C, though it seems like it clashes with the style of
> the project pretty badly.

The toybox shared C infrastructure isn't exported to the outside world
for use outside of toybox. If its semantics change, we adjust the users
in-tree.

Instrumenting the build to show that in allyesconfig this function is
never used from anywhere is interesting (and can probably be done with
readelf and sed).

> I think the main issue that Elliot is pointing to is that there are no
> internal interfaces to test against or mock out so you don't hose your
> system while running tests (i.e. you can "reify" the file system state
> and kernel state, and the substitute them with fake values in tests).

I've always planned to test these commands under an emulator in a
virtual system. (Aboriginal Linux is a much older project than toybox.)

Heck, back under busybox I was using User Mode Linux as my emulator
(qemu wasn't available yet):

https://git.busybox.net/busybox/tree/testsuite/umlwrapper.sh?h=1_2_1&id=f86a5ba510ef

Before that, I added a chroot mode to busybox tests:

https://git.busybox.net/busybox/tree/testsuite/testing.sh?h=1_2_1&id=f86a5ba510ef#n95

I am aware of that problem, but rather than dissecting the code and
sticking pins in it, I prefer to run the tests under an emulator in an
environment it can trash without repercussions. I just haven't finished
implementing it yet because it doesn't solve the "butterfly effect"
tests. (It's on the todo list!)

Note: solving the butterfly effect tests _is_ possible by providing a
fake /proc instead of a real one, --bind mounting a directory of known
data over /proc for the duration of the test so it produces consistent
results. It's all solvable, it's just a can of worms I haven't opened
yet because I've got six cans of worms going in parallel already.

> I agree it would be nicer if there were such interfaces, but it's
> fairly big surgery, and somewhat annoying to do in C.  I think you
> would have to get rid of most global vars,

I mostly have. All command-specific global variables should go in
GLOBALS() (which mean they go in "this", which a union of structs), and
everything else should be in the global "toys" union except for toybuf
and libbuf.

Let's see

  nm --size-sort toybox_unstripped | sort -k2,2

and ignoring the "r", "t", and "T" entries gives us:

0000000000000001 b completed.6973

grep -r is not finding "completed" as a variable name? Odd...

0000000000000008 b tempfile2zap

In lib/lib.c so copy_tempfile() can let tempfile_handler() know what
file to delete atexit(). Bit of a hack, largely because there's only
_one_ file at a time it can store (not a list, but no users need it to
be a list yet). I think code.html mentions this? (If not it should.)

0000000000000028 b chattr

Blah, that's garbage I missed when cleaning up this contribution. That
should go in GLOBALS(), you can have a union of structs in there to have
per-command variables when sharing a file. (But why do they share a
file? I'd have to dig...)

0000000000000004 B __daylight@@GLIBC_2.2.5
0000000000000008 B __environ@@GLIBC_2.2.5
0000000000000008 B stderr@@GLIBC_2.2.5
0000000000000008 B stdin@@GLIBC_2.2.5
0000000000000008 B stdout@@GLIBC_2.2.5
0000000000000008 B __timezone@@GLIBC_2.2.5

glibc vomited forth these for no apparent reason.

0000000000000048 B toys
0000000000001000 B libbuf
0000000000001000 B toybuf
0000000000002028 B this

The ones I mentioned above, these are _expected_.

0000000000000150 d e2attrs

More lsattr stuff. You'll note that 2013 was before I had "pending",
looks like I missed some cleanup in this command.

0000000000001600 D toy_list

That could probably be "r" with a little more work, although I vaguely
recall adding it made lots of spurious "a const was passed to a
non-const thing! Alas and alack, woe betide! Did you know that string
constants are in the read-only section and segfault if you try to write
to them but the compiler doesn't complain if you pass them to a
non-const argument yet it all works out fine? Oh doom and gloom!"

That's probably why I didn't.

0000000000000004 V daylight@@GLIBC_2.2.5
0000000000000008 V environ@@GLIBC_2.2.5
0000000000000008 V timezone@@GLIBC_2.2.5

glibc again.

000000000000001f W mknod

What is a "W" type symbol?

And of course there's buckets of violations needing to be fixed in
pending, this is just defconfig...

> and use a strategy like Lua
> or sqlite, where they pass around a context struct everywhere, which
> can have system functions like open()/read()/write()/malloc()/etc.

A) On nommu systems you have a limited stack size.

B) I looked into rewriting this in Lua back around 2009 or so. I chose
to stick with C. If you'd like to write a version in Lua, feel free.

If you're proposing that I extensively reengineer the project so you can
use a different style of test architecture, could you please explain
what those tests could test that the way I'm doing it couldn't?

> sqlite has a virtual file system (VFS) abstraction for extensive tests

Linux has --bind and union mounts, containers, and I can run the entire
system under QEMU.

> and Lua lets you plug in malloc/free at least.  They are libraries and
> not programs so I guess that is more natural.

The first time I wrote my own malloc/free wrapper to intercept and track
all allocations in a program was under OS/2 in 1996. I expect all
programmers do that at some point. I can only think of one person who
lists such a wrapper as one of his major life accomplishments on his
resume, and I have longstanding disagreements with that man.

I've been thinking for a long time about making generic recovery
infrastructure so you can nofork() any command and clean up after it.
And when I say "a long time" I mean a decade now:

http://lists.busybox.net/pipermail/busybox/2006-March/053270.html

And the conclusion I came to is "let the OS do it". I'm sure I blogged
about this in like 2011, but if you look at "man execve" and scroll down
to "All process attributes are preserved during an execve(), except the
following:" there's a GIANT list of things we'd need to clean up, and
that's just a _start_. It could mess with environment variables, or mess
with umask... it's just not worth it.

So I've since wandered to "fork a child process, have the child recurse
into the other_main() function to avoid an exec if there's enough stack
space left, and then have it exit with the parent waiting for it" as the
standard way of dealing with stuff that's not already easy. A command
can refrain from altering the process's state and be marked
TOYBOX_NOFORK or it can be a child process the OS cleans up after. I'm
not going for a case in between.

That said, it's important for long-lived processes not to leak. Init or
httpd can't leak, grep and sed can't leak per-file or per-line because
they can have unbounded input size... But again, people are looking for
that with valgrind, and I can make a general test memory and open
filehandles and such in xexit() under TOYBOX_DEBUG.

> I think this is worth keeping in mind perhaps, but it seems like there
> is a lot of other low hanging fruit to address beforehand.

If we need to test C functions in ways that aren't easily allowed by the
users of those C functions, we can write a toys/example command that
calls those C functions in the way we want to check. But if the behavior
isn't already accessible from an existing user of that function in one
of the commands, why do we care again?

> Andy

Rob

 1457906212.0