[Toybox] [PATCH] awk -- more patches

Tue Oct 29 12:15:41 PDT 2024

On 10/23/24 21:01, Ray Gardner wrote:
>> Anyway, I'm not seeing that warning on android-ndk-r26d's llvm, do I
>> need a newer one...? (I can apply just the parentheses removal but I
>> didn't want to edit hunks out of your patch without asking.)
> 
> I know you don't like warnings about code that's not wrong to begin
> with (agreed). It's your project and if you want to not keep in sync
> with what I've done, I think I'll have to see what you do with my
> patches and then maybe get my code in sync with yours.
 >
> ICYMI, FYI, and TL;DR: You may have noticed that toybox awk is
> derivative of a standalone version. The standalone is originally in
> several source files in a somewhat modular division and there's an awk
> script that (with the Makefile) combines and converts the multi-file
> version into a single "monolithic" file.

This is kind of what I'm wondering about: I had some pending cleanups I 
deleted for the most recent round of patches, but I don't want another 
"man.c" situation where I back _all_ the way off.

Your awk is 4x the size of sed.c and almost as long as sh.c (which is a 
dozen commands in a trenchcoat and I should probably figure out how to 
break it up with a lib/sh.c or similar). Heck, your awk is ~700 lines 
longer than busybox's.

I'm glancing past stuff like:

     // The addition of 'perturb' greatly improves the probe sequence.
     // See the Python dict implementation for more details.

And going "is this strictly necessary for a small/simple implementation"?

A huge amount of effort went into this, and I respect that, but I just 
want to build source packages, and at least half of this code seems to 
be optimization.

Are there a lot of performance-critical large awk scripts in the wild 
circa 2024? For a definition of performance-critical that survived 
Moore's Law? (Code written on a pentium 2 and now running on a 64 bit 
multi-gigahertz processor with several megabytes of L2 cache.)

But I need to read what's already there to figure out what is and isn't 
necessary. You were talking about comments: it's not comments in the 
CODE, I've read past zlist, zstring, zvalue, and zmap and don't know 
what any of those things ARE. lbp_table exists but I dunno what an lbp 
is, and:

//// syntax error diagnostic and recovery (Turner's method)
// D.A. Turner, Error diagnosis and recovery in one pass compilers,
// Information Processing Letters, Volume 6, Issue 4, 1977, PP 113-115

... compiler, not interpreter? Doesn't a compiler compile _to_ 
something? (Are you compiling to bytecode internally...?)

Reading through trying to focus on the stuff that (maybe?) isn't an 
optimized hash table implementation to manage ~100 pieces of data, I 
keep hitting things like primary() without a clue what the function is 
for or trying to do. I mean I can go check the callers, and users of 
stuff like:

#define CALLED_BY_PRINT 99987 // Arbitrary, different from any real rbp 
value

But I have yet to even begin to figure out what subset of this it 
actually NEEDS to do be doing for... what counts as real world use cases 
for awk? (Yes, I'm gonna have to sit down and learn awk at some point. 
Today is not that day.)

> (BTW that monolithic file is in Marc Paquette's "Bestiary of
> Single-File Implementations of Programming Languages"
> (https://github.com/marcpaq/b1fipl), and you may be interested in
> seeing his "Ancestry of Unix shells"
> (https://github.com/marcpaq/shellancestry).
> 
> Another script converts the monolithic version into the toybox
> version, mainly by dropping code in #ifndef TOYBOX...#endif sections.
> I try to keep it clean for you.

Thank you.

> But others have seen the standalone code. Gawk maintainer Arnold
> Robbins noticed it, and mentioned it to Nelson H.F. Beebe (U. of
> Utah math prof), who wrote me "I tried builds on a few systems at
> first, and reported my findings to Arnold. The result were positive,
> so I spent several more hours doing builds on a variety of Unix-family
> operating systems, covering all of the major CPU families used on
> desktops since the early 1990s."

Cool.

> Beebe is "probably one of the biggest users of awk in the world,
> having written hundreds of thousands of lines of code in that
> language." He did the first port of awk to PDP-10 in 1987.
> 
> He asked me to "fix" the clang warnings. I had not tried clang but
> I got it so I could test it out on the standalone versions.

I'm still using the NDK. I want to build llvm from source so (among 
other things) I can do a properly supported hexagon target, but I'm 
still trying to dig out from burnout and my stress levels are unlikely 
to go DOWN before the election...

> I can work on patching up my "source code build" so that the
> standalone monolithic and toybox versions are separated. I'd
> rather keep them in sync for now.
> 
> Higher priority for me for now is improving the random number
> generator for both (they're separate now b/c I didn't want toybox to
> have the extra code of the one I've got in the standalone, since
> you're already using random() for other stuff),

I haven't properly triaged this awk.c because it's like 12th on my todo 
list and the top half-dozen things are actively on fire, plus it's still 
moving.

> squashing bugs,
> improving portability of the standalone, improving performance (I have
> a guy asking if I can make it as fast as mawk -- as if). And doing
> more of Prof. Beebe's suggestions.
> 
> Also, I'd like to explore making it smaller. I have an idea that's
> (baked<<½) and may not go anywhere. But before getting into that, I'd
> like to get awk on the track toward the posix folder and being built
> default in toybox. What needs to happen? What can I do to help that?

$ wc -l toys/posix/*.c | head -n -1 | sort -n | tail -n 10
    409 toys/posix/sort.c
    485 toys/posix/patch.c
    544 toys/posix/cp.c
    547 toys/posix/file.c
    548 toys/posix/grep.c
    645 toys/posix/ls.c
    727 toys/posix/find.c
   1117 toys/posix/sed.c
   1218 toys/posix/tar.c
   2012 toys/posix/ps.c

I note that ps.c is 5 commands in one file (ps, top, iotop, pgrep, 
pkill) and my go-to example of something embarassingly large I'd like to 
break up into smaller pieces. The lib/ps.c exports would be a pain 
though, quite a wide API. Really the problem is having the SLOT enum in 
lib/lib.h would be uncomfortable, but I dowanna add a lib/ps.h...

Alas tar.c is basically two commands in one (create and extract) that 
share data structures but not much code. It might wind up sharing code 
with other archivers if I do zip/mkisofs/mksquashfs so maybe some of it 
could move to lib/ someday, but that's tricksy design work, and what put 
it over 1000 lines was feature creep: the --tarxform support, multiple 
formats of "sparse" support, the deeply silly --to-command support 
people wanted, xattr support. So embarassingly large, but I dunno how to 
fix it at the moment.

Sadly, sed is legitimately that large (93 lines of help output 
implementing over two dozen commands letters). I was originally 
expecting awk to have the ballpark complexity of sed, but apparently not?

Rob