[Toybox] [PATCH] awk -- more patches

Tue Oct 29 17:38:53 PDT 2024

On Tue, Oct 29, 2024 at 3:15 PM Rob Landley <rob at landley.net> wrote:

> On 10/23/24 21:01, Ray Gardner wrote:
> >> Anyway, I'm not seeing that warning on android-ndk-r26d's llvm, do I
> >> need a newer one...? (I can apply just the parentheses removal but I
> >> didn't want to edit hunks out of your patch without asking.)
> >
> > I know you don't like warnings about code that's not wrong to begin
> > with (agreed). It's your project and if you want to not keep in sync
> > with what I've done, I think I'll have to see what you do with my
> > patches and then maybe get my code in sync with yours.
>  >
> > ICYMI, FYI, and TL;DR: You may have noticed that toybox awk is
> > derivative of a standalone version. The standalone is originally in
> > several source files in a somewhat modular division and there's an awk
> > script that (with the Makefile) combines and converts the multi-file
> > version into a single "monolithic" file.
>
> This is kind of what I'm wondering about: I had some pending cleanups I
> deleted for the most recent round of patches, but I don't want another
> "man.c" situation where I back _all_ the way off.
>

i had a man.c rewrite large enough to realize what was wrong with the
rewrite, but never got round to starting again and doing it the right way.
but, yeah, the current one seems like an evolutionary dead end.

> Your awk is 4x the size of sed.c and almost as long as sh.c (which is a
> dozen commands in a trenchcoat and I should probably figure out how to
> break it up with a lib/sh.c or similar). Heck, your awk is ~700 lines
> longer than busybox's.
>
> I'm glancing past stuff like:
>
>      // The addition of 'perturb' greatly improves the probe sequence.
>      // See the Python dict implementation for more details.
>
> And going "is this strictly necessary for a small/simple implementation"?
>
> A huge amount of effort went into this, and I respect that, but I just
> want to build source packages, and at least half of this code seems to
> be optimization.
>
> Are there a lot of performance-critical large awk scripts in the wild
> circa 2024? For a definition of performance-critical that survived
> Moore's Law? (Code written on a pentium 2 and now running on a 64 bit
> multi-gigahertz processor with several megabytes of L2 cache.)
>

i certainly had trouble when AOSP bust one of one-true-awk's scalability
limits. i suspect -- like with all sufficiently general-purpose tools --
this might be one of those MS Word cases where the answer is "any given
person on needs 1% of it, but it's a different 1% for different people".

but that seems like the philosophical issue the two of you are struggling
with, perhaps because you're at opposite ends of the spectrum... i think
ray wants there to be [pun very much intended] a "one true awk" so there's
never any excuse to have feature or performance differences, whereas you
kind of want the opposite --- a minimal awk for toybox, and "other options
are available" if you want something more.

(the really hard part is that getting from "one true bc" or "one true awk
[mk ii]" to "tiny bc"/"tiny awk" is a lot of work, so the most likely
outcome is getting stuck in the middle, with a lightly forked
slightly-smaller-but-not-much variant that's then harder to take upstream
fixes/additions from.)

> But I need to read what's already there to figure out what is and isn't
> necessary. You were talking about comments: it's not comments in the
> CODE, I've read past zlist, zstring, zvalue, and zmap and don't know
> what any of those things ARE. lbp_table exists but I dunno what an lbp
> is, and:
>
> //// syntax error diagnostic and recovery (Turner's method)
> // D.A. Turner, Error diagnosis and recovery in one pass compilers,
> // Information Processing Letters, Volume 6, Issue 4, 1977, PP 113-115
>
> ... compiler, not interpreter? Doesn't a compiler compile _to_
> something? (Are you compiling to bytecode internally...?)
>
> Reading through trying to focus on the stuff that (maybe?) isn't an
> optimized hash table implementation to manage ~100 pieces of data, I
> keep hitting things like primary() without a clue what the function is
> for or trying to do. I mean I can go check the callers, and users of
> stuff like:
>
> #define CALLED_BY_PRINT 99987 // Arbitrary, different from any real rbp
> value
>
> But I have yet to even begin to figure out what subset of this it
> actually NEEDS to do be doing for... what counts as real world use cases
> for awk? (Yes, I'm gonna have to sit down and learn awk at some point.
> Today is not that day.)
>
> > (BTW that monolithic file is in Marc Paquette's "Bestiary of
> > Single-File Implementations of Programming Languages"
> > (https://github.com/marcpaq/b1fipl), and you may be interested in
> > seeing his "Ancestry of Unix shells"
> > (https://github.com/marcpaq/shellancestry).
> >
> > Another script converts the monolithic version into the toybox
> > version, mainly by dropping code in #ifndef TOYBOX...#endif sections.
> > I try to keep it clean for you.
>
> Thank you.
>
> > But others have seen the standalone code. Gawk maintainer Arnold
> > Robbins noticed it, and mentioned it to Nelson H.F. Beebe (U. of
> > Utah math prof), who wrote me "I tried builds on a few systems at
> > first, and reported my findings to Arnold. The result were positive,
> > so I spent several more hours doing builds on a variety of Unix-family
> > operating systems, covering all of the major CPU families used on
> > desktops since the early 1990s."
>
> Cool.
>
> > Beebe is "probably one of the biggest users of awk in the world,
> > having written hundreds of thousands of lines of code in that
> > language." He did the first port of awk to PDP-10 in 1987.
> >
> > He asked me to "fix" the clang warnings. I had not tried clang but
> > I got it so I could test it out on the standalone versions.
>
> I'm still using the NDK. I want to build llvm from source so (among
> other things) I can do a properly supported hexagon target, but I'm
> still trying to dig out from burnout and my stress levels are unlikely
> to go DOWN before the election...
>

since cc on macOS is clang, that's actually a pretty good proxy if that's
easier. on macOS i too see just the one warning from ToT toybox's awk:
```
/tmp/toybox$ make
scripts/make.sh

warning: using unfinished code from toys/pending
generated/{Config.in,newtoys.h,flags.h,help.h}
Compile toybox
...............................................................toys/pending/awk.c:994:28:
warning: equality comparison with extraneous parentheses
[-Wparentheses-equality]
  994 |     } else if ((TT.scs->ch == '(')) {
      |                 ~~~~~~~~~~~^~~~~~
toys/pending/awk.c:994:28: note: remove extraneous parentheses around the
comparison to silence this warning
  994 |     } else if ((TT.scs->ch == '(')) {
      |                ~           ^     ~
toys/pending/awk.c:994:28: note: use '=' to turn this equality comparison
into an assignment
  994 |     } else if ((TT.scs->ch == '(')) {
      |                            ^~
      |                            =
......1 warning generated.
...........................................................
/tmp/toybox$
```

> > I can work on patching up my "source code build" so that the
> > standalone monolithic and toybox versions are separated. I'd
> > rather keep them in sync for now.
> >
> > Higher priority for me for now is improving the random number
> > generator for both (they're separate now b/c I didn't want toybox to
> > have the extra code of the one I've got in the standalone, since
> > you're already using random() for other stuff),
>
> I haven't properly triaged this awk.c because it's like 12th on my todo
> list and the top half-dozen things are actively on fire, plus it's still
> moving.
>
> > squashing bugs,
> > improving portability of the standalone, improving performance (I have
> > a guy asking if I can make it as fast as mawk -- as if). And doing
> > more of Prof. Beebe's suggestions.
> >
> > Also, I'd like to explore making it smaller. I have an idea that's
> > (baked<<½) and may not go anywhere. But before getting into that, I'd
> > like to get awk on the track toward the posix folder and being built
> > default in toybox. What needs to happen? What can I do to help that?
>
> $ wc -l toys/posix/*.c | head -n -1 | sort -n | tail -n 10
>     409 toys/posix/sort.c
>     485 toys/posix/patch.c
>     544 toys/posix/cp.c
>     547 toys/posix/file.c
>     548 toys/posix/grep.c
>     645 toys/posix/ls.c
>     727 toys/posix/find.c
>    1117 toys/posix/sed.c
>    1218 toys/posix/tar.c
>    2012 toys/posix/ps.c
>
> I note that ps.c is 5 commands in one file (ps, top, iotop, pgrep,
> pkill) and my go-to example of something embarassingly large I'd like to
> break up into smaller pieces. The lib/ps.c exports would be a pain
> though, quite a wide API. Really the problem is having the SLOT enum in
> lib/lib.h would be uncomfortable, but I dowanna add a lib/ps.h...
>
> Alas tar.c is basically two commands in one (create and extract) that
> share data structures but not much code. It might wind up sharing code
> with other archivers if I do zip/mkisofs/mksquashfs so maybe some of it
> could move to lib/ someday, but that's tricksy design work, and what put
> it over 1000 lines was feature creep: the --tarxform support, multiple
> formats of "sparse" support, the deeply silly --to-command support
> people wanted, xattr support. So embarassingly large, but I dunno how to
> fix it at the moment.
>
> Sadly, sed is legitimately that large (93 lines of help output
> implementing over two dozen commands letters). I was originally
> expecting awk to have the ballpark complexity of sed, but apparently not?
>

yeah, there's _way_ more language and functionality in awk than sed. i'd
have guessed it would end up at least 5x larger than sed, so your 4x sounds
pretty good to me :-)

(O.G. one-true-awk is 8k lines, fwiw.)

> Rob
>
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20241029/ec964a4b/attachment-0001.htm>