[Toybox] awk (Re: ps down, top to go)

Sun May 22 11:23:53 PDT 2016

On 05/11/2016 01:41 AM, Andy Chu wrote:
>> Oh I was quite impressed with Lua, but all programming languages operate
>> within a framework and Lua intentionally doesn't provide a usable
>> standard framework.
> 
> The way I think of it is that Lua doesn't provide the program with any
> "capabilities" by default (in the security sense).  You have to
> explicitly grant capabilities by providing hooks to your application.

Providing write() but not printf(), or + - * / but no math library with
trig functions, has nothing to do with security.

The X11 problem was always "Here's a window and line drawing primitives.
Creating a toolkit for buttons and sliders and pulldown menus and such
is left as an exercise to the user, there's no standard one provided and
12 non-standard ones which all suck".

Hence qt vs gtk. They don't let you do anything you couldn't without
them, they just save you writing giant piles of code yourself.

> This is actually one of the things that attracted me to it, since
> having a secure environment opens up some interesting possibilities
> with executing remote code (like JavaScript).

The most secure system is powered off, ground into a fine powder, mixed
with acid, encased in concrete, and dropped into a deep sea trench.
Ideally in a way that the acid will eat through the concrete and
dissolve the whole mess into the ocean near the bottom. (And that's
assuming you haven't got the budget to fire it into the sun and closely
monitor its entire trip there.)

> Tcl has a similar embedded language design philosophy, but it happened
> to come with GUI libraries and such which made it popular for awhile.
> 
> I don't think Lua "refused" to provide a standard library... people
> were mostly using it for games and embedded applications, and there
> just wasn't a strong enough community running it on POSIX or whatever.
>
> It was just 1 or 2 academics who wrote all the code -- they never had
> a public source repo or accepted patches.

I was under the impression it had a vigorous community doing stuff for a
decade before anybody who spoke English noticed, because they were doing
it in portugese.

Practical result's the same either way.

>>> busybox awk looks like a pretty straightforward interpreter
>>> architecture from what I can tell -- lex, parse, walk a tree to
>>> execute, and runtime support with hash tables and so forth.
>>
>> Possibly awk and sh can share parser infrastructure. Not sure yet.
> 
> One thing to note is that they use opposite parsing algorithms:
> 
> * sh: All implementations except bash use a hand-written recursive
> descent parser, i.e. top down parsing; whereas bash uses yacc, i.e.
> bottom up parsing.  And bash regrets the choice.

I wasn't planning to use yacc.

> * awk: All implementations except busybox awk use yacc (bottom up).

I wasn't planning to use yacc here either.

> It's not entirely clear to me what algorithm busybox awk is using; I
> think it is a hand-written bottom up parser.  Doesn't look like
> recursive descent for sure.

My limiting factor with awk is I need to collect a large corpus of awk
test scripts so I know what success looks like.

> The difference arises from the language itself.  The main sh language
> has no expressions and hence no left recursion; it's essentially LL(1)
> (except for looking ahead to find the ( in a function def).

You can recurse, you can throw stuff on a stack. Not a big deal either way.

No man page for ll or LL. When I type "ll" Ubuntu has it as an alias for
ls -l (so no prompt for a package to install). And LL says command not
found (again, no prompt for a package to install).

> Awk has TWO expression languages -- the conditions can be combined
> with boolean logic (e.g. $1 == "foo" && $2 == "bar), and the
> procedural action language has arithmetic.  So bottom up parsing works
> better here.

Don't care.

>> What is and isn't a bug is... It took me a while to figure out why this
>> works:
>>
>>   for i in a b c; do echo $i; done
>>
>> But this is a syntax error even though I can put a newline after the do:
>>
>>   for i in a b c; do; echo $i; done
> 
> The shell syntax is definitely weird at first, but this distinction
> follows directly from the POSIX grammar -- which I mentioned is
> accurate in the sense that all the implementations I tested are very
> conformant.  (The exception is bash which doesn't allow unbraced
> single command function definitions.  Try "func() ls /; func" in bash
> and dash; according to the grammar, dash is correct.)
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_10

No.

Any time "bash is wrong but dash is correct", posix is wrong. Posix is
saying that the de-facto Linux shell got this wrong for almost 20 years
and nobody noticed, then a shell that I could trivially segfault when
Ubuntu first swapped /bin/sh for it, and which "sleep 100 &" and then
ctrl-c at the prompt would kill the backgrounded sleep... That was doing
it "right".

No. No it wasn't. Posix was at _best_ irrelevant.

> The relevant productions are:
> 
> ... For name linebreak in wordlist sequential_sep do_group
> 
> do_group         : Do compound_list Done           /* Apply rule 6 */
> 
> compound_list    :              term
>                  | newline_list term
>                  |              term separator
>                  | newline_list term separator
> 
> 
> compound_list can start with a list of newlines, but it can't start
> with semicolons.  That's why you can have newlines after "do" but not
> a semicolon.

I think I'm allowing the semicolon in mine, because there's no obvious
reason not to.

>> 1) is basically ( as a command (it's a context shift command like if or
>> while, but it's a command, same block definition as above; see also {
>> and } blocks).
>>
>> 2) happens during environment variable parsing (the _fun_ bit is the
>> quoting in "$(echo "$(ls -l)")")
> 
> In my parser, there's nothing special about a command sub surrounded
> by double quotes surrounded by a command sub surrounded by double
> quotes.  That's all handled straightforwardly by the recursion (ditto
> for evaluating the expression).  However, detecting the ) that matches
> a command sub is not so straightforward, since there are 4 uses of ).
> It does involve a stack in the lexer; it's debatable whether "context
> stack" describes it.
> 
>> Oh, speaking of { } blocks, you can do this on the command line:
>>
>>   { echo -e "one\ntwo\nthree"
>>   } | tac
>>
>> But if you don't have the line break in there the } is considered an
>> argument to echo and you get a prompt for continuation until you feed it
>> } on the start of a line. You can use a ; instead of a newline though,
>> that's "start of a line" enough.
> 
> Right this is because { and } are "reserved words", while ( and ) are
> operators.  A reserved word has to be delimited by space, whereas an
> operator delimits itself.  Reserved words are only special if they are
> the FIRST word, so echo } doesn't need to be quoted, but echo ) does.

I know.

> (echo hi)   # valid without spaces
> {echo hi}   # not what you think
> { echo hi }  # not what you think either
> { echo hi; }  # correct because ; is an operator, and } is the first
> word in the next command

You're explaining back at me what I said.

>>> There is a similar problem with ${} overloading --
>>> it's used for anonymous blocks and brace expansion, in addition to var
>>> expansion.  I found bash bugs here too.
>>
>> Such as...?
> 
> The test case I came up with is:
> 
> $ echo ${foo:-$({ which ls; })}
> -bash: syntax error near unexpected token `)'
> 
> $ dash
> $ echo ${foo:-$({ which ls; })}
> /bin/ls

You said they said they regret using yacc as their parser. :)

> This is a command sub with a braced block inside it, as the default
> value inside ${}.  Bash gets confused about the matching }.  Something
> like ${foo:-${bar}} should work fine though.

I just checked at echo ${blah:-"$({ ls; })"} works, which isn't hugely
surprising.

>> Context stack? That was my way. Lots of this parsing needs to nest
>> arbitrarily deep, and it can cross lines:
>>
>>   $ echo ${hello:-
>>   > there}
>>   there
> 
> Right, this is the PS2 problem.  When you hit enter, do you execute
> the command, or print > and continue parsing?

Eh, not that big a deal. My question was more whether

 $ ls ; echo ${hello:-

Should run the ls before prompting for the rest of the echo.

> Actually this case is broken in dash -- try "echo ${ <newline>" in
> bash and dash.  (Although I'm sure nobody really cares.)

I don't really care what dash does. It is defective and annoying, says
so right in the acronym.

>> And if you put a double quote before the $ and after the } you get a
>> newline before there. If you don't, command line argument parsing and
>> reblocking strips it.
>>
>> What do I mean by reblocking? I mean this:
>>
>>   $ printf "one %s three %s\n" ${hello:-two four}
>>   one two three four
> 
> I don't see anything special about this; it's a straightforward
> consequence of word splitting.

Is that what the standard calls it? It's been years since I read through
the thing from start to finish, terminology gets a bit fuzzy.

> Because there are no quotes around
> ${hello...}, its value is subject to word splitting, so there are two
> arguments to printf.

Yes, I know why it does it.

> Quotes change the behavior as you would expect;

You keep thinking I would expect things, but "$@"

> now there is one argument to printf:
> 
> printf "one %s three %s\n" "${hello:-two four}"
> one two four three
> 
> (with the last %s expanding to empty)
> 
>>> The bash aosabook chapter which I've referred to several times talks
>>> about how they had to duplicate a lot of the parser to handle this,
>>> and it still isn't right:
>>
>> I'm not looking at bash's implementation, I'm looking at the spec and
>> what it does when I feed it lots of test cases (what inputs produce what
>> outputs).
> 
> You apparently have a love-hate relationship with bash.

It's GNU code widely used by Linux. So yeah.

> You explicitly said you want to write bash and not just sh, yet you don't
> want to look at how it implements anything :)

I never look at FSF code. On general princples. But the behavior of the
standard Linux command line is what Linux developers (and the build
systems they write) expect.

>> Years ago I was trying to get it to preserve NUL bytes in the output of
> 
>> Toybox doesn't use libc getopt(), we use lib/args.c (which does not use
>> libc getopt), so what you decide to do in your shell and what it makes
>> sense for toysh to do may not be related to each other here.
> 
> Sure, I'm just describing what it does.  I agree getopts is an awkward
> interface in sh, but if you want a POSIX shell, much less a bash
> clone, you need it.

Yeah but I might be able to use lib/args.c syntax instead of getopt
syntax, since my stuff is mostly a superset of their stuff. Haven't dug
into that todo item yet. Not hugely worried about it either way.

>> Keep in mind, over the years people have written a dozen different
>> shells. It's really not that big a deal, I just want to do it _right_ so
>> I'm trying to reserve a large block of time so that once I start I can
>> finish rather than getting repeatedly interrupted. And that means
>> knocking down a bunch of smaller todo items first.
> 
> I definitely agree that you want a big block of uninterrupted time.
> (I've been off work since March so I've got that going for me.)
> 
> It's not clear to me that any reasonably popular shell was started
> later than 1990 or so (is zsh the latest?).  I think the BSDs are
> using code started 40+ years ago.  I don't know when mksh is from, but
> I think it must be that old too.

This is why I want a bash replacement. Large existing userbase should be
able to move over as painlessly as possible. I'm not trying to invent
significant new syntax here.

A shell is fairly central to the idea of unix, and the default shell of
Linux has always been bash. (Ubuntu's insanity notwithstanding: the way
ubuntu admitted its mistake was to make /bin/bash the default _login_
shell, so it was in all the /etc/passwd entries despite #!/bin/sh
pointing to something political and useless.)

> As I mentioned, my goal isn't to simply implement sh, because that's
> been done.  It seems to me that 25 years is a good interval to have
> some innovation in the shell.  I'm just starting with sh so it's a
> superset of what is known to work, and so people actually have a
> migration path.

The same way C is decades old therefore Objective C and C++ and so on
_must_ be an improvement?

I have seen lots, and lots, and LOTS of new languages fork off of
existing stuff over the years. Back when I was on fidonet in the 90's
somebody had collected a list of TWO THOUSAND programming languages,
which seemed kind of excessive. (I don't still have this list and it
would be 20+ years out of date anyway, but I remember there was more
than one language named "oberon".)

At $DAYJOB one of the programmers wrote an openoffice spreadsheet to
VHDL translation layer in something called leingen, which is a dialect
of scheme (which is a dialect of lithp) using java virtual machine
features. This did not seem advisable to me, and yet it exists and
nobody's had time to rewrite it yet.

Good luck with your project, it is a can of worms I have _zero_ interest in.

Rob