[Toybox] [PATCH] awk -- fixes and cleanups

Rob Landley rob at landley.net
Fri Oct 25 09:44:34 PDT 2024


I was writing this when I got distracted by your new patches, I should 
finish up and send it...

On 10/18/24 18:08, Ray Gardner wrote:
>>> Is random() not good? Maybe not the best, but what is not remotely about it?
>>> Period about 16*(2**31-1) not enough?
> 
> OK, it's not good enough. I don't know how bad random() is aside from
> the period, but that is inadequate. I can put in a Bays-Durham shuffle
> to fix that. I can also put in my own PRNG in place of random(), but it
> seems you would prefer to keep toybox small.

In general if you want real randomness, I ask the kernel for random 
bits. If you want repeatable prng streams where I can specify a seed to 
get reproducible behavior, randomly changing the PRNG out from under it 
is a bad thing.

You seem to be talking about something in between...?

>> Your patch went from initializing the random number generator from a fraction of
>> a second to an even second (which is forever in computer timme), and I don't
>> know why you did that. What's the benefit?
> 
>>> The patch only addresses the seed when no arg is given to srand(). Yes it's
>>> for compatibility,
> 
>> Compatibility with what?
> 
> I should have said compliance -- posix compliance. It's in the spec.
> But also compatibility: gawk, nawk, mawk, goawk, busybox awk, all
> support it.

To what end? Which of the above two use cases is being supported here? 
Actual randomness (which the kernel is better at than userspace), or 
reproducible PRNG sequences from a known seed?

> This is the test that tests that srand() is inited to the current
> second. srand() returns the previous seed.

And getrandom() doesn't need to be seeded.

(Well, ok, it does but that's not an individual userspace command's 
problem. That's an init script and hardware config question.)

> [...]
> 
>>>> Patch 3: You've made your input path pretty complicated to fix an issue that
>>>> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
> [...]
> 
>>>> Eh, applied anyway because it's local to a pending/ command but can't say I'm
>>>> happy about it.
> 
>>> I'm not too happy with it either. I'd be happy with some help on the entire
>>> input mechanism. But it's complicated.
> [...]
>> I've lost track of "record separator" vs "field separator" vs "line separator" here.
> 
> There is no line separator.

So newline is hardwired there?

> Records and fields. RS is record
> separator, FS is field separator. FS can be a regex. Regex RS
> is "unspecified" in posix awk, but every awk supports regex RS.
> The RS="" case is multiline records. Runs of blank lines separate
> records, with leading and trailing blank lines ignored.
> 
> The "interactive" awk input is supported by all awks I test (gawk,
> nawk, mawk, goawk, busybox awk).
> 
> [...]
>>> But now every awk allows RS to be a regex, which is "unspecified" by POSIX.
> 
>> Have you found anything using that yet? (Did it break a build script?)
> 
> I don't know what uses it, but every awk supports it.

I'm surprised the "one true awk" that Android's been building supports a 
feature that DIDN'T make it into Posix 2024...

When did busybox add this? Hmmm... (Oh the test cases you'll need: 
busybox commits 5323af7f5180 af1bd0962527 patched up this logic because 
it's corner case city...) Looks like it was there when Glenn copied this 
file into a different directory in 2002 (commit 545106f8db3e) so 
presumably they copied it from gnu/dammit in the first place. (For all I 
know their code _was_ from some ancient version of gnu awk, since it was 
a GPLv2 project at the time...)

> I am trying to
> be posix compliant (modulo some locale stuff), but also handling
> common extensions and non-posix behavior of other awks.
> 
> I've completely re-written the wonky record-reading code to be much
> cleaner. Was difficult. The regex RS is used to support the multiline
> record feature. Is in the next batch of patches.

I deleted my local cleanups to apply your patches. (It's man.c all over 
again...)

> [...]
>>> Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
>>> gawk (5.3.0 in my case) pass all the tests.
 >
> I'm testing against gawk 5.3.1 now, and it does NOT pass all the
> tests any longer. Not sure why it's failing two of the new ones:
> "split() utf8" and "split fields utf8". Both do essentially the same
> as:
> 
> BEGIN{n = split("aβc", a, ""); printf "%d %d", n, length(a);for (e =
> 1; e <= n; e++) printf " %s %s", e, "(" a[e] ")";print ""}
> 
> BEGIN{FS=""; $0 = "aβc"; printf "%d", NF; for (e = 1; e <= NF; e++)
> printf " %s %s", e, "(" $e ")"; print ""}
> 
> And both of those work in gawk when run directly.

Part of the problem writing tests is figuring out how much to test. I 
keep having test skew running bash through "TEST_HOST=1 make test_sh" 
although part of the reason is I keep emailing chet and he FIXES stuff. 
But if there's no standard and no stable reference implementation the 
test might not mean anything.

My preferred test case is running as many package builds through it as 
possible, since that's the real world data. Over in busybox-land alpine 
linux and friends did that, but toybox wasn't ready when that ecosystem 
launched and it's not designed to allow drop-in replacement.

I've been vaguely poking at 
https://wiki.debian.org/HelmutGrohne/rebootstrap and 
https://forums.debian.net/viewtopic.php?t=66936 and so on, but the 
debian repo isn't exactly designed to let you do this either. (Erik 
Andersen did it for busybox+uclibc many moons ago, but that's lost to 
history and he never went that far...)

My todo.txt says I should watch:

watch (from https://peertube.debian.social/videos/local?s=1)
   Automating architecture bootstrap:
     https://peertube.debian.social/w/45XroN9CnbYLNLKQH3GD9F
   Building a busybox based debian
     https://peertube.debian.social/w/chzkKrMvEczG7qQyjbMKPr
   Android tools BOF
     https://peertube.debian.social/w/n57xBrJnLZQS1L112cV6wi
   spontaneous cross compiling BOF
     https://peertube.debian.social/w/3kJncnbRoXaCS1MSQqRixT
   Running debian desktop in containers
     https://peertube.debian.social/w/gpVkreH3CUXGfLd45c7ULt

And I did watch the "building a busybox based debian" one but his answer 
was basically "smash the repo dependencies, just tell it to extract 
these packages into an empty directory with nothing satisfied and then 
manually fix it up with a text editor and some symlinks"...

(I emailed him before the move and he said he had plans to fix the 
download links as part of a software upgrade of the site...?)

>> When gnu adds a new creepy thing nobody else supports yet, I generally wait for
>> something to break due to its absence rather than jumping through hoops to chase
>> their taillights. Right now, busybox awk is used to build rather a lot of
>> packages, and if they haven't made the decision to implement this weirdness than
>> packages are likely to continue to support NOT doing that...
> 
> I'm not doing anything busybox isn't already doing, except I handle utf-8
> and have fewer bugs. The tests busybox can't pass are bugs. It has rather
> a lot, I think.

Because as far as I can tell busybox's version might be based off a fork 
of gnu awk from the 1990's? It's hard to tell _where_ it came from 
(other than the comment at the top of the file which doesn't say how 
much was original code and how much was repurposed from elsewhere) 
because the conversion from svn to git seems to have broken this history 
at commit 545106f8db3e and the old svn is long gone. I could dig through 
old tarballs, and did for https://busybox.net/~landley/forensics.txt 
which doesn't mention awk so it probably wasn't there in 0.25...

> Ray

Rob


More information about the Toybox mailing list