[Toybox] [PATCH] awk -- fixes and cleanups
Rob Landley
rob at landley.net
Fri Oct 25 09:44:34 PDT 2024
I was writing this when I got distracted by your new patches, I should
finish up and send it...
On 10/18/24 18:08, Ray Gardner wrote:
>>> Is random() not good? Maybe not the best, but what is not remotely about it?
>>> Period about 16*(2**31-1) not enough?
>
> OK, it's not good enough. I don't know how bad random() is aside from
> the period, but that is inadequate. I can put in a Bays-Durham shuffle
> to fix that. I can also put in my own PRNG in place of random(), but it
> seems you would prefer to keep toybox small.
In general if you want real randomness, I ask the kernel for random
bits. If you want repeatable prng streams where I can specify a seed to
get reproducible behavior, randomly changing the PRNG out from under it
is a bad thing.
You seem to be talking about something in between...?
>> Your patch went from initializing the random number generator from a fraction of
>> a second to an even second (which is forever in computer timme), and I don't
>> know why you did that. What's the benefit?
>
>>> The patch only addresses the seed when no arg is given to srand(). Yes it's
>>> for compatibility,
>
>> Compatibility with what?
>
> I should have said compliance -- posix compliance. It's in the spec.
> But also compatibility: gawk, nawk, mawk, goawk, busybox awk, all
> support it.
To what end? Which of the above two use cases is being supported here?
Actual randomness (which the kernel is better at than userspace), or
reproducible PRNG sequences from a known seed?
> This is the test that tests that srand() is inited to the current
> second. srand() returns the previous seed.
And getrandom() doesn't need to be seeded.
(Well, ok, it does but that's not an individual userspace command's
problem. That's an init script and hardware config question.)
> [...]
>
>>>> Patch 3: You've made your input path pretty complicated to fix an issue that
>>>> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
> [...]
>
>>>> Eh, applied anyway because it's local to a pending/ command but can't say I'm
>>>> happy about it.
>
>>> I'm not too happy with it either. I'd be happy with some help on the entire
>>> input mechanism. But it's complicated.
> [...]
>> I've lost track of "record separator" vs "field separator" vs "line separator" here.
>
> There is no line separator.
So newline is hardwired there?
> Records and fields. RS is record
> separator, FS is field separator. FS can be a regex. Regex RS
> is "unspecified" in posix awk, but every awk supports regex RS.
> The RS="" case is multiline records. Runs of blank lines separate
> records, with leading and trailing blank lines ignored.
>
> The "interactive" awk input is supported by all awks I test (gawk,
> nawk, mawk, goawk, busybox awk).
>
> [...]
>>> But now every awk allows RS to be a regex, which is "unspecified" by POSIX.
>
>> Have you found anything using that yet? (Did it break a build script?)
>
> I don't know what uses it, but every awk supports it.
I'm surprised the "one true awk" that Android's been building supports a
feature that DIDN'T make it into Posix 2024...
When did busybox add this? Hmmm... (Oh the test cases you'll need:
busybox commits 5323af7f5180 af1bd0962527 patched up this logic because
it's corner case city...) Looks like it was there when Glenn copied this
file into a different directory in 2002 (commit 545106f8db3e) so
presumably they copied it from gnu/dammit in the first place. (For all I
know their code _was_ from some ancient version of gnu awk, since it was
a GPLv2 project at the time...)
> I am trying to
> be posix compliant (modulo some locale stuff), but also handling
> common extensions and non-posix behavior of other awks.
>
> I've completely re-written the wonky record-reading code to be much
> cleaner. Was difficult. The regex RS is used to support the multiline
> record feature. Is in the next batch of patches.
I deleted my local cleanups to apply your patches. (It's man.c all over
again...)
> [...]
>>> Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
>>> gawk (5.3.0 in my case) pass all the tests.
>
> I'm testing against gawk 5.3.1 now, and it does NOT pass all the
> tests any longer. Not sure why it's failing two of the new ones:
> "split() utf8" and "split fields utf8". Both do essentially the same
> as:
>
> BEGIN{n = split("aβc", a, ""); printf "%d %d", n, length(a);for (e =
> 1; e <= n; e++) printf " %s %s", e, "(" a[e] ")";print ""}
>
> BEGIN{FS=""; $0 = "aβc"; printf "%d", NF; for (e = 1; e <= NF; e++)
> printf " %s %s", e, "(" $e ")"; print ""}
>
> And both of those work in gawk when run directly.
Part of the problem writing tests is figuring out how much to test. I
keep having test skew running bash through "TEST_HOST=1 make test_sh"
although part of the reason is I keep emailing chet and he FIXES stuff.
But if there's no standard and no stable reference implementation the
test might not mean anything.
My preferred test case is running as many package builds through it as
possible, since that's the real world data. Over in busybox-land alpine
linux and friends did that, but toybox wasn't ready when that ecosystem
launched and it's not designed to allow drop-in replacement.
I've been vaguely poking at
https://wiki.debian.org/HelmutGrohne/rebootstrap and
https://forums.debian.net/viewtopic.php?t=66936 and so on, but the
debian repo isn't exactly designed to let you do this either. (Erik
Andersen did it for busybox+uclibc many moons ago, but that's lost to
history and he never went that far...)
My todo.txt says I should watch:
watch (from https://peertube.debian.social/videos/local?s=1)
Automating architecture bootstrap:
https://peertube.debian.social/w/45XroN9CnbYLNLKQH3GD9F
Building a busybox based debian
https://peertube.debian.social/w/chzkKrMvEczG7qQyjbMKPr
Android tools BOF
https://peertube.debian.social/w/n57xBrJnLZQS1L112cV6wi
spontaneous cross compiling BOF
https://peertube.debian.social/w/3kJncnbRoXaCS1MSQqRixT
Running debian desktop in containers
https://peertube.debian.social/w/gpVkreH3CUXGfLd45c7ULt
And I did watch the "building a busybox based debian" one but his answer
was basically "smash the repo dependencies, just tell it to extract
these packages into an empty directory with nothing satisfied and then
manually fix it up with a text editor and some symlinks"...
(I emailed him before the move and he said he had plans to fix the
download links as part of a software upgrade of the site...?)
>> When gnu adds a new creepy thing nobody else supports yet, I generally wait for
>> something to break due to its absence rather than jumping through hoops to chase
>> their taillights. Right now, busybox awk is used to build rather a lot of
>> packages, and if they haven't made the decision to implement this weirdness than
>> packages are likely to continue to support NOT doing that...
>
> I'm not doing anything busybox isn't already doing, except I handle utf-8
> and have fewer bugs. The tests busybox can't pass are bugs. It has rather
> a lot, I think.
Because as far as I can tell busybox's version might be based off a fork
of gnu awk from the 1990's? It's hard to tell _where_ it came from
(other than the comment at the top of the file which doesn't say how
much was original code and how much was repurposed from elsewhere)
because the conversion from svn to git seems to have broken this history
at commit 545106f8db3e and the old svn is long gone. I could dig through
old tarballs, and did for https://busybox.net/~landley/forensics.txt
which doesn't mention awk so it probably wasn't there in 0.25...
> Ray
Rob
More information about the Toybox
mailing list