[Toybox] [PATCH] awk -- fixes and cleanups

Ray Gardner raygard at gmail.com
Fri Oct 18 16:08:29 PDT 2024


On Mon, Sep 9, 2024 at 8:49 AM Rob Landley <rob at landley.net> wrote:
> On 9/7/24 21:04, Ray Gardner wrote:
> > On Tue, Sep 3, 2024 at 11:43 PM Rob Landley <rob at landley.net> wrote:
[...]
> >> On 8/30/24 15:02, Ray Gardner wrote:
[...]

> >> Patch 4: How is the first change a good thing? (What's the benefit?) I mean
> >> "awk's version of $RANDOM isn't remotely, so let's weaken it further for
> >> compatibility...? In a way we can't obviously figure out how to test without, I
> >> dunno, container plumbing?
[ ... discussion of awk's rand(), srand() limitations ... ]

> > Is random() not good? Maybe not the best, but what is not remotely about it?
> > Period about 16*(2**31-1) not enough?

OK, it's not good enough. I don't know how bad random() is aside from
the period, but that is inadequate. I can put in a Bays-Durham shuffle
to fix that. I can also put in my own PRNG in place of random(), but it
seems you would prefer to keep toybox small.

> Your patch went from initializing the random number generator from a fraction of
> a second to an even second (which is forever in computer timme), and I don't
> know why you did that. What's the benefit?

> > The patch only addresses the seed when no arg is given to srand(). Yes it's
> > for compatibility,

> Compatibility with what?

I should have said compliance -- posix compliance. It's in the spec.
But also compatibility: gawk, nawk, mawk, goawk, busybox awk, all
support it.

In the posix spec, the "Future Directions" part says better seeding
options may be in a future spec. See spec for details. Can implement
them now, but awk would get bigger.
[...]

> Do you have a test in the test suite that srand() is initialized to the current
> second?

> > BTW I have written a test (attached), based on the fact that srand(srand())
> > returns unix time in seconds.

> And this is important to you?

This is the test that tests that srand() is inited to the current
second. srand() returns the previous seed.

[...]

> >> Patch 3: You've made your input path pretty complicated to fix an issue that
> >> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
[...]

> >> Eh, applied anyway because it's local to a pending/ command but can't say I'm
> >> happy about it.

> > I'm not too happy with it either. I'd be happy with some help on the entire
> > input mechanism. But it's complicated.
[...]
> I've lost track of "record separator" vs "field separator" vs "line separator" here.

There is no line separator. Records and fields. RS is record
separator, FS is field separator. FS can be a regex. Regex RS
is "unspecified" in posix awk, but every awk supports regex RS.
The RS="" case is multiline records. Runs of blank lines separate
records, with leading and trailing blank lines ignored.

The "interactive" awk input is supported by all awks I test (gawk,
nawk, mawk, goawk, busybox awk).

[...]
> > But now every awk allows RS to be a regex, which is "unspecified" by POSIX.

> Have you found anything using that yet? (Did it break a build script?)

I don't know what uses it, but every awk supports it. I am trying to
be posix compliant (modulo some locale stuff), but also handling
common extensions and non-posix behavior of other awks.

I've completely re-written the wonky record-reading code to be much
cleaner. Was difficult. The regex RS is used to support the multiline
record feature. Is in the next batch of patches.

[...]
> > Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
> > gawk (5.3.0 in my case) pass all the tests.

I'm testing against gawk 5.3.1 now, and it does NOT pass all the
tests any longer. Not sure why it's failing two of the new ones:
"split() utf8" and "split fields utf8". Both do essentially the same
as:

BEGIN{n = split("aβc", a, ""); printf "%d %d", n, length(a);for (e =
1; e <= n; e++) printf " %s %s", e, "(" a[e] ")";print ""}

BEGIN{FS=""; $0 = "aβc"; printf "%d", NF; for (e = 1; e <= NF; e++)
printf " %s %s", e, "(" $e ")"; print ""}

And both of those work in gawk when run directly.

> When gnu adds a new creepy thing nobody else supports yet, I generally wait for
> something to break due to its absence rather than jumping through hoops to chase
> their taillights. Right now, busybox awk is used to build rather a lot of
> packages, and if they haven't made the decision to implement this weirdness than
> packages are likely to continue to support NOT doing that...

I'm not doing anything busybox isn't already doing, except I handle utf-8
and have fewer bugs. The tests busybox can't pass are bugs. It has rather
a lot, I think.

Ray


More information about the Toybox mailing list