[Toybox] [PATCH] awk -- more patches

Thu Oct 31 16:35:03 PDT 2024

On Tue, Oct 29, 2024 at 1:15 PM Rob Landley <rob at landley.net> wrote:

> On 10/23/24 21:01, Ray Gardner wrote:

> Your awk is 4x the size of sed.c and almost as long as sh.c ...
> ... Heck, your awk is ~700 lines longer than busybox's.

Yes, bbox's awk is amazingly small. I think mine is smaller than all the
other awk implementations though.

>      // The addition of 'perturb' greatly improves the probe sequence.
> And going "is this strictly necessary for a small/simple implementation"?

That perturb thing we discussed here last November. It adds very little
to the "stride" computation but makes it way more efficient.

> A huge amount of effort went into this, and I respect that,

Hundreds of hours over two years actually.

> but I just want to build source packages, and at least half of this
> code seems to be optimization.

There's little or no optimization as far as I know.

> Are there a lot of performance-critical large awk scripts in the wild

This awk is not very high performance, I am actually disappointed and
want to make it faster.

> But I need to read what's already there to figure out what is and isn't
> necessary. You were talking about comments: it's not comments in the
> CODE, I've read past zlist, zstring, zvalue, and zmap and don't know
> what any of those things ARE.

Structs implementing expandible lists, ref-counted strings, any value
(number, ptr to string, ptr to array) and arrays ("maps"). See
https://www.raygard.net/awkdoc/ (please?).

> lbp_table exists but I dunno what an lbp is, and:

> //// syntax error diagnostic and recovery (Turner's method)

I could probably shave some code by quitting on first syntax error, as I
think some other awks do.

> ... compiler, not interpreter? Doesn't a compiler compile _to_
> something? (Are you compiling to bytecode internally...?)

Compiling to "zcode" -- like bytecode, but int-size instructions.

> keep hitting things like primary() without a clue what the function is

primary() parses num, string, regex, var, var with subscripts, and
function calls, along with their prefix and postfix operators, and
parenthesized subexpressions. The terminology is fairly standard, e.g.
see https://www.cs.unc.edu/~plaisted/comp455/Algol60.pdf near bottom of
first page. lbp is "left binding power" from Pratt parsing, e.g. see
https://www.engr.mun.ca/~theo/Misc/pratt_parsing.htm.
It's similar to the precedence levels used in expr.c and sh.c but Pratt
is a little more flexible.

> #define CALLED_BY_PRINT 99987 // Arbitrary, different from any real rbp

Print statements in awk are a hassle; this is part of dealing with them.
I think there's a bit about it in my raygard.net/awkdoc writeup.

> But I have yet to even begin to figure out what subset of this it
> actually NEEDS to do be doing for... what counts as real world use cases
> for awk? (Yes, I'm gonna have to sit down and learn awk at some point.
> Today is not that day.)

Do you want a subset of awk? How do you know which parts? I attempt to
cover nearly all of posix awk plus common extensions found in most
awks.

> I haven't properly triaged this awk.c because it's like 12th on my todo
> list and the top half-dozen things are actively on fire, plus it's still
> moving.

It will keep moving as I fix bugs and improve it. Unix awk is nearly 40
years old and Kernighan patched it over 200 times until just last year
when he handed it off. All the awks continue to get fixes.

> Sadly, sed is legitimately that large (93 lines of help output
> implementing over two dozen commands letters). I was originally
> expecting awk to have the ballpark complexity of sed, but apparently not?

You've said (9 May 2016) "3K is a _lot_. ... If awk's much bigger
than sed, I'm probably doing it wrong." And (9 Aug 2021) "And the last
version of busybox I released (1.2.1, circa 2007) had a ~2800 line awk
implementation, so it shouldn't be too hard to get something feasible
in toybox (probably under 1000 lines? Dunno.)"

You implied you can make an awk much smaller than busybox awk. That
underestimates what's involved.  When I print the posix specs
(https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sed.html)
(https://pubs.opengroup.org/onlinepubs/9799919799/utilities/awk.html)
sed is about 7.5 pages up to "Application Usage", 12 total;
awk is about 28.5 pages up to "Application Usage", 38.5 total.
I think it's reasonable to expect awk to be about 4x the size of sed.
It's an entire programming language, requiring a tokenizer, parser,
and interpreter.

I know you want little perfect bonsai; not always possible. awk will be
big, bc will be big, yacc will be big if you get one. (Still want one?)

> Rob

Ray