[Toybox] awk (Re: ps down, top to go)

Mon May 9 01:13:02 PDT 2016

On 05/08/2016 10:54 PM, Andy Chu wrote:
>> Well, way back when I tried to make sense of busybox's awk
>> implementation, which is around 3000 lines of C.
>>
>> More recently, I read about half the posix awk description and dug up a
>> copy of the original "The AWK Programming Language" book by Aho,
>> Kernighan, and Weinberger from 1988, which I've read the introduction of.
> 
> I have that book, and skimmed it again a few weeks ago.  It's actually
> a real gem that goes beyond awk into computer science -- they have a
> nice mini-Make implementation in awk, and as well as topological
> sorting and recursive descent parsing, etc.  (FWIW, Programming in Lua
> by the Lua authors is another book in the same that vein -- it's great
>  even if like me you're not that keen on Lua the language.)

Oh I was quite impressed with Lua, but all programming languages operate
within a framework and Lua intentionally doesn't provide a usable
standard framework.

In C the framework is libc plus your OS-specific context (of which posix
tries to be the common subset of unix systems and unix-derived systems).
You can add MORE libraries to that (libx, libz, etc), but you GET libc
more or less for free, in the language spec. The assumption is you have
an OS with system calls in the default case.

Javascript defaulted to running in the context of a web page within a
browser, then node.js provided an application context. Emacs lithp
operates on a text buffer within a web server. PHP operates within a web
server context. Java had both applet context for being a sandboxed web
browser plugin, and application context for more or less posix bindings
(except they wanted to cram threads down everybody's throat so refused
to provide poll/select bindings for the longest time).

Lua refused to provide _any_ standard framework. They expected to be
embedded in another program, with C bindings, so a framework to make the
sucker usable is left as an exercise for the reader. A decade+ after the
language shipped a Posix library sort of became a defacto standard, but
it's incomplete (when I looked into "rewriting busybox in lua" years ago
I decided I need to install 7 prerequisite packages just to have the
bindings to do ifconfig and netcat and grep and so on, and that's just
silly) and not installed by default when you install lua.

> The 8K LOC Kernighan Awk is apparently the one described in this book.
> But I wonder how far that is from common practice?  If busybox awk can
> do it in 3K LOC (+ libs and regex), then that's VERY encouraging.  I
> thought it was going to be closer to the 21K LOC mawk.

3K is a _lot_. The top 10 commands outside the "pending" directory
according to wc -l are:

    383 toys/lsb/mount.c
    400 toys/posix/sort.c
    423 toys/posix/patch.c
    495 toys/posix/cp.c
    517 toys/other/ifconfig.c
    564 toys/posix/find.c
    585 toys/posix/ls.c
    723 toys/other/bzcat.c
   1081 toys/posix/sed.c
   1683 toys/posix/ps.c

The only reason ps is ~1700 lines is it implements 5 different commands
(ps, top, iotop, pgrep, pkill), _and_ I haven't figured out a clean way
to factor out the common infrastructure (at least half those lines) into
lib yet. (I also note that lib/*.c is 3653 lines total, plus another 230
lines of main.c is basically all our shared C infrastructure.)

If awk's much bigger than sed, I'm probably doing it wrong.

> busybox awk looks like a pretty straightforward interpreter
> architecture from what I can tell -- lex, parse, walk a tree to
> execute, and runtime support with hash tables and so forth.

Possibly awk and sh can share parser infrastructure. Not sure yet.

> next_token() is a little longer than I'd like at 200+ lines, but
> that's nothing compared to 1000+ line lexing functions in mksh or bash
> (and I'm not even counting the serious macro usage).  Everything about
> shell is harder than awk -- lexing, parsing, execution, interactivity,
> etc.  (An exception might be the number of dialects; shell does seem
> to be pretty highly standardized and the implementations are
> conformant.)

There's a posix spec. I'll probably implement at least most of that on
general principles, even before worrying too much about real world test
cases. (I say that not having read it recently/thoroughly though. But
since I don't regularly use awk, I don't have the domain expertise to
say what it SHOULDN'T have...)

>>> >From my research, it seems like a significant easier problem than the
>>> shell.
>>
>> Yeah, probably. The shell is actually about 30 commands integrated
>> together, several of which are literal commands and several of which are
>> implicit "expand this environment variable, which could be $RANDOM or
>> $PPID or $SECONDS which actually invokes a function but you still need
>> to support ${#SECONDS} or ${RANDOM:1:3}. Some of them literally are
>> external commands where [ is "test" and : is "true" and "help" I already
>> did, and I already did ulimit, some are $(( )) is kinda like expr but
>> not quite, "trap" is its own thing, "read" is actually fairly elaborate,
>> command history navigation (and "history expansion" which I've never
>> personally used)... Job control is a whole subsystem (and pipes and
>> redirection are integrated into that; you suspend a _pipeline_ not a
>> process, and kill needs job control integration when run from toysh).
>> $PWD != abspath although it looks like getcwd() returns what we need
>> there, but I need to adjust cd to strip directory entries instead ofa
>> ctually traversing the filesystem .. (but only for _leading_ .. in the
>> path, I think? Need to test). There's all that loop and test logic,
>> shell functions and alias, pushd/popd/dirs, I have NEVER understood what
>> the "getopts" command is for but need to try again, don't get me STARTED
>> on the dozens of different things "set" does let alone "set -o"...
> 
> Right, there is an entire sublanguage within ${} and  $(()), and those
> sublanguages have to be intertwined with the main language in parsing
> and execution.

I've done most of the research (twice now!), I just haven't sat down to
plow through the implementation because I'm closing tabs at the moment
rather than opening new ones.

There's like 6 commands that need to get cleaned up and promoted out of
pending, and I'm halfway through lib/linestack.c (now that top.c is
putting some weight on it, I should do less and vi while it's still soft
and malleable).

> I think I have all the language stuff under control.  The hardest part
> is parsing $() because you have run the entire parser and know where
> the ending ) is.

Context stack.

> Right paren is used in four places:
> 
> 1. subshell -- ( echo hi )
> 2. command sub $(echo hi) -- this is parsed MUCH differently than
> subshell, because it's at the intra-word level and not the outer word
> level, e.g.  'foo'$(echo hi)'"bar"$((1+1)) is all a single shell word,
> but you can't put a subshell in there.
> 3. case foo)  -- bash has bugs with putting this in a subshell.

What is and isn't a bug is... It took me a while to figure out why this
works:

  for i in a b c; do echo $i; done

But this is a syntax error even though I can put a newline after the do:

  for i in a b c; do; echo $i; done

(Answer: do/done takes a block between them that turns into a nameless
function, and if you start a command with ";" the shell bitches hugely
for no obvious reason, while newline is just whitespace and thus
ignored. I wonder if posix requires it to complain about leading ;? I
probably looked this up at some point, but don't remember the answer off
the top of my head...)

> To make it more exciting, case (foo) is also valid in the POSIX grammar
> (and all implementations).
> 4. func() { }

1) is basically ( as a command (it's a context shift command like if or
while, but it's a command, same block definition as above; see also {
and } blocks).

2) happens during environment variable parsing (the _fun_ bit is the
quoting in "$(echo "$(ls -l)")")

3) case/esac is another "implicit function" thingy like done/done or
then/fi. It has to be parsed as its own context shift on the context stack.

4) is basically parentheses as the first argument of a command, which
requires a { } block the way if requires a then/fi block.

Oh, speaking of { } blocks, you can do this on the command line:

  { echo -e "one\ntwo\nthree"
  } | tac

But if you don't have the line break in there the } is considered an
argument to echo and you get a prompt for continuation until you feed it
} on the start of a line. You can use a ; instead of a newline though,
that's "start of a line" enough.

See also:

  { echo -e "one\ntwo\nthree" ; sleep 3; echo hello; } &

> Since anything can appear within $(), all those right parens have to
> be disambiguated.

No, you just parse recursively (or iteratively with a context stack).

> There is a similar problem with ${} overloading --
> it's used for anonymous blocks and brace expansion, in addition to var
> expansion.  I found bash bugs here too.

Such as...?

  echo ${USER:$((${#USER}-4))} ${BLAH:-${THINGY:-fred}}

> I have a very nice and clean way of figuring out ) though... if
> anyone's interested I can show it.

Context stack? That was my way. Lots of this parsing needs to nest
arbitrarily deep, and it can cross lines:

  $ echo ${hello:-
  > there}
  there

And if you put a double quote before the $ and after the } you get a
newline before there. If you don't, command line argument parsing and
reblocking strips it.

What do I mean by reblocking? I mean this:

  $ printf "one %s three %s\n" ${hello:-two four}
  one two three four

Seriously, I've researched this at some length. You have to have the
output of things feeding into the input of other things in the right order.

> The bash aosabook chapter which I've referred to several times talks
> about how they had to duplicate a lot of the parser to handle this,
> and it still isn't right:

I'm not looking at bash's implementation, I'm looking at the spec and
what it does when I feed it lots of test cases (what inputs produce what
outputs).

Years ago I was trying to get it to preserve NUL bytes in the output of
$(blah) arguments, but the reason that can't work is the way C feeds
argv[] to child programs. But I explored through to figure out what the
ISSUE was, anyway...

> "Much of the work to recognize the end of the command substitution
> during the parsing stage is encapsulated into a single function
> (parse_comsub), which knows an uncomfortable amount of shell syntax
> and duplicates rather more of the token-reading code than is optimal.
> This function has to know about here documents, shell comments,
> metacharacters and word boundaries, quoting, and when reserved words
> are acceptable (so it knows when it's in a case statement); it took a
> while to get that right."
> 
> http://aosabook.org/en/bash.html

I have a policy of not caring how gnu actually implemented anything in a
way that isn't at least visible to strace. I do NOT look at their source
code, on principle.

(I can't help it maintaining lastgplv2 gcc and binutils forks, but for
stuff like sed? Nope. Still haven't.)

> The reason I can do better essentially boils down to the fact that I'm
> using a top-down rather than bottom-up parser (yacc).
> 
> "getopts" is simply a binding for libc getopt().  It just looks weird
> because sh is a weird programming language.  But it's basically a 1 to
> 1 mapping.  You provide a spec string like "a:bcd:" and then you loop
> over global variables to fill out your own flags struct/global
> variables.

Toybox doesn't use libc getopt(), we use lib/args.c (which does not use
libc getopt), so what you decide to do in your shell and what it makes
sense for toysh to do may not be related to each other here.

This is one reason that toybox never permutes its environment string
(which screws up what ps and top see). Our toys.optargs[] is a second
argv[] array, a new malloc() pointing to the environment string data.
The original argv[] is still there with the unmodified contents
(accessible as toys.argv[] if you need it).

> The main part I don't have figured out is interactive job control.  I
> don't really ever use that, since I use tmux.  I have interactive
> parsing and autocompletion figured out though (with prototype code.)

I learned how to do all the job control by studying lash way back when.
(Erik Andersen's "Lame-Ass SHell". :)

> Shell is definitely hard because it's big, and there's a lot of
> overlapping and interacting features.  But I think just running in
> batch mode is a huge step forward because you can test it with builds,
> and you can be confident you have all the parsing solved.

It's not that hard, it's just time consuming.

Keep in mind, over the years people have written a dozen different
shells. It's really not that big a deal, I just want to do it _right_ so
I'm trying to reserve a large block of time so that once I start I can
finish rather than getting repeatedly interrupted. And that means
knocking down a bunch of smaller todo items first.

(That said, $DAYJOB can use this and I might have to start sooner
anyway. Right now they're using busybox hush for a prototype that can't
have any GPL code when it ships. Then again they can always port netbsd
ash to linux if it comes to that...)

> Though I made sure that my parser works both in batch mode and
> interactively.  Interactive parsing is hard because of $PS2!!!  $PS2
> is when you get > after an incomplete command.  That interacts with
> quoting, here docs, compound commands, etc.

Nah, that's basically a command history problem. The subsystem that
handles interactive input worries about that, the shell doesn't
particularly care.

The shell operates from interactive and non-interactive input. If you
"echo ls -l | sh" you're not running interactively. Here documents
aren't running interactively. "sh filename.sh" isn't running interactively.

Note that "interactive" and "terminal" are not the same thing, you can
be interactive without a terminal (init=/bin/sh then /dev/console
doesn't provide one) and you can have a terminal but not be interactive
(three examples above where the input commands come from is not a
terminal device and thus don't $PS1, but if the shell script you're
executing calls "read" then it needs to read from stdin or the tty.

> This is barely mentioned
> in the POSIX spec but every implementation I tried does it right

Bash man page has a thing on it. There's a PS3 and PS4 too, but I can
probably hold off on caring about that until somebody complains about
select and trace.

> (except for a tiny bug in dash I found.)

Oh don't get me started on the Defective Annoying SHell. Try this:

  while true; do echo echo hello; sleep 1; done | sh

Does not hang forever, it prints each hello as it gets the command. So
clearly it's not reading to EOF before launching the script. But then if
you do this:

  $ echo -e "read x\nhello\n echo \$x" | sh
  sh: 2: hello: not found

Because it read past the hello and then read is bypassing the stdin FILE
* buffer and getting the EOF on fd 0. If you use "bash" instead of "sh"
on ubuntu, it works fine and you get "hello". Yes, I care about getting
this right.

(Seriously, I've been studying this on and off from a "how do I
implement my own version of this" perspective since 2006.)

> Andy

Rob