[Toybox] awk (Re: ps down, top to go)

Tue May 10 23:41:27 PDT 2016

> Oh I was quite impressed with Lua, but all programming languages operate
> within a framework and Lua intentionally doesn't provide a usable
> standard framework.

The way I think of it is that Lua doesn't provide the program with any
"capabilities" by default (in the security sense).  You have to
explicitly grant capabilities by providing hooks to your application.

This is actually one of the things that attracted me to it, since
having a secure environment opens up some interesting possibilities
with executing remote code (like JavaScript).

Tcl has a similar embedded language design philosophy, but it happened
to come with GUI libraries and such which made it popular for awhile.

I don't think Lua "refused" to provide a standard library... people
were mostly using it for games and embedded applications, and there
just wasn't a strong enough community running it on POSIX or whatever.
It was just 1 or 2 academics who wrote all the code -- they never had
a public source repo or accepted patches.

>> busybox awk looks like a pretty straightforward interpreter
>> architecture from what I can tell -- lex, parse, walk a tree to
>> execute, and runtime support with hash tables and so forth.
>
> Possibly awk and sh can share parser infrastructure. Not sure yet.

One thing to note is that they use opposite parsing algorithms:

* sh: All implementations except bash use a hand-written recursive
descent parser, i.e. top down parsing; whereas bash uses yacc, i.e.
bottom up parsing.  And bash regrets the choice.

* awk: All implementations except busybox awk use yacc (bottom up).
It's not entirely clear to me what algorithm busybox awk is using; I
think it is a hand-written bottom up parser.  Doesn't look like
recursive descent for sure.

The difference arises from the language itself.  The main sh language
has no expressions and hence no left recursion; it's essentially LL(1)
(except for looking ahead to find the ( in a function def).

Awk has TWO expression languages -- the conditions can be combined
with boolean logic (e.g. $1 == "foo" && $2 == "bar), and the
procedural action language has arithmetic.  So bottom up parsing works
better here.

> What is and isn't a bug is... It took me a while to figure out why this
> works:
>
>   for i in a b c; do echo $i; done
>
> But this is a syntax error even though I can put a newline after the do:
>
>   for i in a b c; do; echo $i; done

The shell syntax is definitely weird at first, but this distinction
follows directly from the POSIX grammar -- which I mentioned is
accurate in the sense that all the implementations I tested are very
conformant.  (The exception is bash which doesn't allow unbraced
single command function definitions.  Try "func() ls /; func" in bash
and dash; according to the grammar, dash is correct.)

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10

The relevant productions are:

... For name linebreak in wordlist sequential_sep do_group

do_group         : Do compound_list Done           /* Apply rule 6 */

compound_list    :              term
                 | newline_list term
                 |              term separator
                 | newline_list term separator

compound_list can start with a list of newlines, but it can't start
with semicolons.  That's why you can have newlines after "do" but not
a semicolon.

> 1) is basically ( as a command (it's a context shift command like if or
> while, but it's a command, same block definition as above; see also {
> and } blocks).
>
> 2) happens during environment variable parsing (the _fun_ bit is the
> quoting in "$(echo "$(ls -l)")")

In my parser, there's nothing special about a command sub surrounded
by double quotes surrounded by a command sub surrounded by double
quotes.  That's all handled straightforwardly by the recursion (ditto
for evaluating the expression).  However, detecting the ) that matches
a command sub is not so straightforward, since there are 4 uses of ).
It does involve a stack in the lexer; it's debatable whether "context
stack" describes it.

> Oh, speaking of { } blocks, you can do this on the command line:
>
>   { echo -e "one\ntwo\nthree"
>   } | tac
>
> But if you don't have the line break in there the } is considered an
> argument to echo and you get a prompt for continuation until you feed it
> } on the start of a line. You can use a ; instead of a newline though,
> that's "start of a line" enough.

Right this is because { and } are "reserved words", while ( and ) are
operators.  A reserved word has to be delimited by space, whereas an
operator delimits itself.  Reserved words are only special if they are
the FIRST word, so echo } doesn't need to be quoted, but echo ) does.

(echo hi)   # valid without spaces
{echo hi}   # not what you think
{ echo hi }  # not what you think either
{ echo hi; }  # correct because ; is an operator, and } is the first
word in the next command

>> There is a similar problem with ${} overloading --
>> it's used for anonymous blocks and brace expansion, in addition to var
>> expansion.  I found bash bugs here too.
>
> Such as...?

The test case I came up with is:

$ echo ${foo:-$({ which ls; })}
-bash: syntax error near unexpected token `)'

$ dash
$ echo ${foo:-$({ which ls; })}
/bin/ls

This is a command sub with a braced block inside it, as the default
value inside ${}.  Bash gets confused about the matching }.  Something
like ${foo:-${bar}} should work fine though.

> Context stack? That was my way. Lots of this parsing needs to nest
> arbitrarily deep, and it can cross lines:
>
>   $ echo ${hello:-
>   > there}
>   there

Right, this is the PS2 problem.  When you hit enter, do you execute
the command, or print > and continue parsing?

Actually this case is broken in dash -- try "echo ${ <newline>" in
bash and dash.  (Although I'm sure nobody really cares.)

> And if you put a double quote before the $ and after the } you get a
> newline before there. If you don't, command line argument parsing and
> reblocking strips it.
>
> What do I mean by reblocking? I mean this:
>
>   $ printf "one %s three %s\n" ${hello:-two four}
>   one two three four

I don't see anything special about this; it's a straightforward
consequence of word splitting.  Because there are no quotes around
${hello...}, its value is subject to word splitting, so there are two
arguments to printf.  Quotes change the behavior as you would expect;
now there is one argument to printf:

printf "one %s three %s\n" "${hello:-two four}"
one two four three

(with the last %s expanding to empty)

>> The bash aosabook chapter which I've referred to several times talks
>> about how they had to duplicate a lot of the parser to handle this,
>> and it still isn't right:
>
> I'm not looking at bash's implementation, I'm looking at the spec and
> what it does when I feed it lots of test cases (what inputs produce what
> outputs).

You apparently have a love-hate relationship with bash.  You
explicitly said you want to write bash and not just sh, yet you don't
want to look at how it implements anything :)

> Years ago I was trying to get it to preserve NUL bytes in the output of

> Toybox doesn't use libc getopt(), we use lib/args.c (which does not use
> libc getopt), so what you decide to do in your shell and what it makes
> sense for toysh to do may not be related to each other here.

Sure, I'm just describing what it does.  I agree getopts is an awkward
interface in sh, but if you want a POSIX shell, much less a bash
clone, you need it.

> Keep in mind, over the years people have written a dozen different
> shells. It's really not that big a deal, I just want to do it _right_ so
> I'm trying to reserve a large block of time so that once I start I can
> finish rather than getting repeatedly interrupted. And that means
> knocking down a bunch of smaller todo items first.

I definitely agree that you want a big block of uninterrupted time.
(I've been off work since March so I've got that going for me.)

It's not clear to me that any reasonably popular shell was started
later than 1990 or so (is zsh the latest?).  I think the BSDs are
using code started 40+ years ago.  I don't know when mksh is from, but
I think it must be that old too.

As I mentioned, my goal isn't to simply implement sh, because that's
been done.  It seems to me that 25 years is a good interval to have
some innovation in the shell.  I'm just starting with sh so it's a
superset of what is known to work, and so people actually have a
migration path.

>> Though I made sure that my parser works both in batch mode and
>> interactively.  Interactive parsing is hard because of $PS2!!!  $PS2
>> is when you get > after an incomplete command.  That interacts with
>> quoting, here docs, compound commands, etc.
>
> Nah, that's basically a command history problem. The subsystem that
> handles interactive input worries about that, the shell doesn't
> particularly care.

Not sure I see what you're saying here.  My point was that line
continuation with PS2 happens at many levels: the here doc level
("pre-lexical"), lexical level with \\\n or '', word level with "" or
${} or $() or $(()), and command level with for/while/if/case, etc.
The language has 1 or 2 more layers than something like C or Awk.

> Oh don't get me started on the Defective Annoying SHell. Try this:
>
>   while true; do echo echo hello; sleep 1; done | sh
>
> Does not hang forever, it prints each hello as it gets the command. So
> clearly it's not reading to EOF before launching the script. But then if
> you do this:
>
>   $ echo -e "read x\nhello\n echo \$x" | sh
>   sh: 2: hello: not found
>
> Because it read past the hello and then read is bypassing the stdin FILE
> * buffer and getting the EOF on fd 0. If you use "bash" instead of "sh"
> on ubuntu, it works fine and you get "hello". Yes, I care about getting
> this right.

The aosabook chapter mentions this case of getting both code AND data
from stdin:

http://aosabook.org/en/bash.html

"When the shell is not using readline, it uses either stdio or its own
buffered input routines to obtain input. The bash buffered input
package is preferable to stdio when the shell is not interactive
because of the somewhat peculiar restrictions Posix imposes on input
consumption: the shell must consume only the input necessary to parse
a command and leave the rest for executed programs. This is
particularly important when the shell is reading a script from the
standard input. The shell is allowed to buffer input as much as it
wants, as long as it is able to roll the file offset back to just
after the last character the parser consumes. As a practical matter,
this means that the shell must read scripts a character at a time when
reading from non-seekable devices such as pipes, but may buffer as
many characters as it likes when reading from files."

Do you know in POSIX where it "imposes restrictions on input
consumption" ?  I didn't see it.

I don't know if I like the idea of reading one byte at a time as a
special case, JUST for this.  Where does it come up?  Have you seen
real scripts which read both code and data simultaneously from stdin?
Since I have a polymorphic readline() reader and a file reader, I
could probably handle this as a PipeReader implementing the same
interface, but I would want to be convinced that it's not wasted code.

Andy