[Toybox] And back.

Sat Apr 9 15:58:31 PDT 2016

On Sat, Apr 9, 2016 at 2:52 PM, Rob Landley <rob at landley.net> wrote:
> I spent last week traveling to present at two conferences:
>
>   http://flourishconf.com/2016/speakers.php?id=18
>
> http://openiotelc2016.sched.org/event/6D9v/building-a-cpu-from-scratch-jcore-design-walkthrough-rob-landley-jeff-dionne-se-instruments
>
> Hopefully the videos will be up before too long. The slides are
> http://landley.net/talks/flourish-2016.txt and
> http://j-core.org/ELC-2016.pdf

Welcome back, and thanks for the links!  I'm always looking for things
to put on my queue of programming videos.  I have to admit some
ignorance about open hardware and nommu but it sounds interesting.

> And several days before that reinstalling my laptop (well, before
> reinstalling I tried to upgrade ubuntu 12.04 to 14.04 _without_ a
> reinstall, that video is at https://www.youtube.com/watch?v=26RTlPgg-tA ).

FWIW I think it's always wise to leave yourself a path to back out,
rather than doing destructive updates (which I gather was an issue).
I have two interchangeable OS partitions, and $HOME lives on a third
data partition.  So if the new OS has issues I can always go back the
old one by rebooting.

> So if you have a pending todo item that you think I should be working
> on, giving me a poke to remind me where I was wouldn't go amiss.

I think we have a few open threads:

(1) expr has memory safety issues on master.  I posted a message
demonstrating this with the ASAN patches:

http://lists.landley.net/pipermail/toybox-landley.net/2016-March/004884.html

A lot of this has fallen out of my cache, but I think at this point it
makes sense to just follow the original strategy and be done with it:
http://lists.landley.net/pipermail/toybox-landley.net/2016-March/004827.html

This patch is easy to inspect as correct... (and passes ASAN tests,
etc.)  The other ones require a lot of code reading and are not
correct.  The common case at runtime I suspect is 0, 1, or MAYBE 2
mallocs, so it basically isn't going to matter AFAICT ... if there is
a fast way to grep for expr on say Aboriginal packages that could help
answer this.

(2) I sent patches for a couple buffer overflows found by
AddressSanitizer.  These are quite obvious and should be easy to
review.

http://lists.landley.net/pipermail/toybox-landley.net/2016-March/004852.html
http://lists.landley.net/pipermail/toybox-landley.net/2016-March/004853.html

Given that the test coverage is fairly low (e.g. there was an
IRC-reported bug in 'paste', but no tests for paste), I think if we
made a wide pass and added very basic "hello world" tests for every
command, I'm sure we'd find dozens of real bugs.

The nice thing about ASAN, etc. is that it increases the value of any
tests you write.  For every code path you hit, you get assertions for
free.  I'm sure that building Aboriginal with toybox_asan, etc. would
shake out tons of issues.

(3) The big test harness patches.  I sent out patches to run tests
with 3 sanitizers: AddressSanitizer, Memory Sanitizer,
UndefinedBehavior Sanitizer.  They are based on runtime instrumention,
so there are pretty much no false positives (except maybe MSAN because
I think you need to build libc with it too).  Nonetheless they are
finding real bugs.

In addition, I got code coverage working a few weeks ago, which I
think you expressed interest in.  I didn't sent out a patch for it.
This a 4th form of runtime instrumentation -- as I recall it requires
a change to toybox source, because toybox calls _exit() rather than
exit().  The runtime instrumentation hooks exit() to write the
coverage data file.

I'm wondering what the general opinion is on these... I think the expr
example shows that they are quite useful.  I saw that you started
applying some of the fixes... although I would caution that there is
actually some substantial knowledge/testing embodied in those patches
-- I wrote them twice as mentioned, once in make and once in shell,
and it definitely got better the second time.

(4) I didn't respond to your last e-mails yet... we were talking about
shell and make.  So I actually have been cranking on my shell
implementation in the last 1-2 weeks :)  I prefer having something
concrete to talk about.

This is something I've wanted to do for a long time.  I had been
experimenting with Python, Lua, OCaml, Lisp, etc. over the years.  And
I guess toybox was to some extent an evaluation for writing it in C.

But it ended up in C++, which is surprising to me (and probably to
some people who have heard me complain about C++).  But it's turning
out really well -- one thing that enabled it is the re2c code
generator, which is huge.  It takes regexes and generates compact
state machines with no dependencies.  It's modular and doesn't force
your code into weird patterns (very much the opposite of lex/flex).

For the parser, I took the POSIX shell grammar and ported it to ANTLR
(but I'm NOT using ANTLR for any code generation, because it's
ridiculously unsuitable for that, in contrast to re2c).

One thing I didn't quite realize is how much context-free grammars are
like code... they need refactoring, testing, and have performance
implications.  IMO the POSIX grammar is not really suitable for source
code, although bash seems to have done that...

ANTLR actually does a nice lookahead analysis on the grammar, which
can help performance. And it's a good tool because it uses top down
parsing algorithms, which fits the shell a lot better than bottom up
parsing like Yacc.

I'm close to the point of porting this debugged grammar to a recursive
descent parser, which will then be able to accomodate all the special
cases that POSIX lays out.  (I am trying to avoid the asymptotic
approach to correctness that some shells seem to use...)

I also wrote a little shell test framework and have been torturing the
nooks and crannies of bash and dash.  I found one place where bash is
not POSIX conformant and dash is, and some other differences.  All in
all I think the POSIX grammar is pretty good and helpful, although it
only covers a single sublanguage out of the 3 or 4 that are in the
shell.

So, the implementation obviously isn't going to be in toybox directly,
but it could be used for Aboriginal.  That's still in the future
though, since I've only implemented a very basic runtime.  It executes
basic commands, but nothing else.  95% of the work has been
lexing/parsing so far.

I will have to write something up, because it seems there hasn't been
a modern implementation of shell in awhile.  Most implementations use
gotos, global variables, and macros a lot -- and not in a good way.
For example, bash's function/macros to get the next token is somewhat
ridiculous, and ash/dash use some horrible goto tricks as well.  The
mksh lexer seems somewhat principled although I didn't follow it all.

Andy

 1460242711.0