[Toybox] awk (Re: ps down, top to go)

Sun May 8 20:54:44 PDT 2016

> Well, way back when I tried to make sense of busybox's awk
> implementation, which is around 3000 lines of C.
>
> More recently, I read about half the posix awk description and dug up a
> copy of the original "The AWK Programming Language" book by Aho,
> Kernighan, and Weinberger from 1988, which I've read the introduction of.

I have that book, and skimmed it again a few weeks ago.  It's actually
a real gem that goes beyond awk into computer science -- they have a
nice mini-Make implementation in awk, and as well as topological
sorting and recursive descent parsing, etc.  (FWIW, Programming in Lua
by the Lua authors is another book in the same that vein -- it's great
 even if like me you're not that keen on Lua the language.)

The 8K LOC Kernighan Awk is apparently the one described in this book.
But I wonder how far that is from common practice?  If busybox awk can
do it in 3K LOC (+ libs and regex), then that's VERY encouraging.  I
thought it was going to be closer to the 21K LOC mawk.

busybox awk looks like a pretty straightforward interpreter
architecture from what I can tell -- lex, parse, walk a tree to
execute, and runtime support with hash tables and so forth.

next_token() is a little longer than I'd like at 200+ lines, but
that's nothing compared to 1000+ line lexing functions in mksh or bash
(and I'm not even counting the serious macro usage).  Everything about
shell is harder than awk -- lexing, parsing, execution, interactivity,
etc.  (An exception might be the number of dialects; shell does seem
to be pretty highly standardized and the implementations are
conformant.)

>> The lexer is 582 lines of clean looking C code (it's Kernighan, so I guess
>> we all know his style :) ), which is not insignificant!
>
> I'm not adding yacc as a build dependency.

AFAICT, the lexer in bwk doesn't really depend on Yacc.  It does
include ytab.h and implement a function called yylex(), but I don't
think there should be any problem just renaming that function and
calling it from your own code.

Lexers are generally easier to reuse than parsers because there are
only a limited number of ways to write them, and they do something
pretty well-defined.  They also seem worth reusing because there is a
lot of hairy logic and corner cases.

But yeah it seems like a BSD license, not a public domain license in
any case.  Though Dennis seems to think it's compatible with your
philosophy.

>>>From my research, it seems like a significant easier problem than the
>> shell.
>
> Yeah, probably. The shell is actually about 30 commands integrated
> together, several of which are literal commands and several of which are
> implicit "expand this environment variable, which could be $RANDOM or
> $PPID or $SECONDS which actually invokes a function but you still need
> to support ${#SECONDS} or ${RANDOM:1:3}. Some of them literally are
> external commands where [ is "test" and : is "true" and "help" I already
> did, and I already did ulimit, some are $(( )) is kinda like expr but
> not quite, "trap" is its own thing, "read" is actually fairly elaborate,
> command history navigation (and "history expansion" which I've never
> personally used)... Job control is a whole subsystem (and pipes and
> redirection are integrated into that; you suspend a _pipeline_ not a
> process, and kill needs job control integration when run from toysh).
> $PWD != abspath although it looks like getcwd() returns what we need
> there, but I need to adjust cd to strip directory entries instead ofa
> ctually traversing the filesystem .. (but only for _leading_ .. in the
> path, I think? Need to test). There's all that loop and test logic,
> shell functions and alias, pushd/popd/dirs, I have NEVER understood what
> the "getopts" command is for but need to try again, don't get me STARTED
> on the dozens of different things "set" does let alone "set -o"...

Right, there is an entire sublanguage within ${} and  $(()), and those
sublanguages have to be intertwined with the main language in parsing
and execution.

I think I have all the language stuff under control.  The hardest part
is parsing $() because you have run the entire parser and know where
the ending ) is.  Right paren is used in four places:

1. subshell -- ( echo hi )
2. command sub $(echo hi) -- this is parsed MUCH differently than
subshell, because it's at the intra-word level and not the outer word
level, e.g.  'foo'$(echo hi)'"bar"$((1+1)) is all a single shell word,
but you can't put a subshell in there.
3. case foo)  -- bash has bugs with putting this in a subshell.  To
make it more exciting, case (foo) is also valid in the POSIX grammar
(and all implementations).
4. func() { }

Since anything can appear within $(), all those right parens have to
be disambiguated.  There is a similar problem with ${} overloading --
it's used for anonymous blocks and brace expansion, in addition to var
expansion.  I found bash bugs here too.

I have a very nice and clean way of figuring out ) though... if
anyone's interested I can show it.

The bash aosabook chapter which I've referred to several times talks
about how they had to duplicate a lot of the parser to handle this,
and it still isn't right:

"Much of the work to recognize the end of the command substitution
during the parsing stage is encapsulated into a single function
(parse_comsub), which knows an uncomfortable amount of shell syntax
and duplicates rather more of the token-reading code than is optimal.
This function has to know about here documents, shell comments,
metacharacters and word boundaries, quoting, and when reserved words
are acceptable (so it knows when it's in a case statement); it took a
while to get that right."

http://aosabook.org/en/bash.html

The reason I can do better essentially boils down to the fact that I'm
using a top-down rather than bottom-up parser (yacc).

"getopts" is simply a binding for libc getopt().  It just looks weird
because sh is a weird programming language.  But it's basically a 1 to
1 mapping.  You provide a spec string like "a:bcd:" and then you loop
over global variables to fill out your own flags struct/global
variables.

The main part I don't have figured out is interactive job control.  I
don't really ever use that, since I use tmux.  I have interactive
parsing and autocompletion figured out though (with prototype code.)

Shell is definitely hard because it's big, and there's a lot of
overlapping and interacting features.  But I think just running in
batch mode is a huge step forward because you can test it with builds,
and you can be confident you have all the parsing solved.

Though I made sure that my parser works both in batch mode and
interactively.  Interactive parsing is hard because of $PS2!!!  $PS2
is when you get > after an incomplete command.  That interacts with
quoting, here docs, compound commands, etc.  This is barely mentioned
in the POSIX spec but every implementation I tried does it right
(except for a tiny bug in dash I found.)

Andy