[Toybox] [PATCH] awk -- fixes and cleanups

Sat Sep 7 19:04:31 PDT 2024

On Tue, Sep 3, 2024 at 11:43 PM Rob Landley <rob at landley.net> wrote:
>
> Sorry, been sick in my Tokyo hotel bed since before the typhoon ended.

Hope you're doing better by now.

> On 8/30/24 15:02, Ray Gardner wrote:
> Patch 1: ok.

> Patch 2: eh, not ideal, but you want an exit() with return code supplied asn
> argument and lib/ doesn't do that, so... eh.

What would make it more ideal?

> Patch 3: You've made your input path pretty complicated to fix an issue that
> seems to boil down to "gracefully handle short reads".

Discussed below because a bit long...

> Patch 4: How is the first change a good thing? (What's the benefit?) I mean
> "awk's version of $RANDOM isn't remotely, so let's weaken it further for
> compatibility...? In a way we can't obviously figure out how to test without, I
> dunno, container plumbing?

awk's version of $RANDOM isn't remotely what? The spec just says awk rand()
returns a non-negative pseudo-random float less than 1; it could use any
PRNG.  I am using POSIX random() twice for each number awk rand() generated.
Is random() not good? Maybe not the best, but what is not remotely about it?
Period about 16*(2**31-1) not enough?

The patch only addresses the seed when no arg is given to srand(). Yes it's
for compatibility, but I'm not sure it's weaker, unless your awk code is
calling srand() more than once a second. The weakness IMO is where the spec
says the seed is 0 to 2**31-1 or behavior is unspecified.

But would you prefer toybox awk to not meet posix spec and not work like
gawk or Kernighan awk?

BTW I have written a test (attached), based on the fact that srand(srand())
returns unix time in seconds.

> Patch 5: is this one testable? You had tests before, but not here.

Tests attached. BTW thanks for asking for them. They turned up a bug in the
last patch. Fixed in first attached patch.
>
> Patch 6: sure.
>
> Patch 7: again, test would be nice.

Tests attached.

> Patch 3: You've made your input path pretty complicated to fix an issue that
> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
> (posix-2008) somehow doesn't apply here (I am way too sick right now to try to
> brain through what the previous code you patched was doing) can you check each
> short read to see if you got whatever the delimiter is currently set to?
>
> I mean, this is what all those arguments about stdin buffer size with Elliott
> have been about... so you added an is_tty flag. My toysh plumbing is pulling
> some of those shenanigans but that's because I have to prompt the user and write
> my own interactive editing and command history plumbing that gnu/dammit made a
> whole separate library for. (And even then I wanted to make that plumbing
> generic and shared between commands, until the new vi implementation walled
> itself off from ever participating in that. I might still write an editor that
> behaves like nano and joe and microemacs to go alongside the bespoke vi
> implementation that will never share any of that code, but that could also be my
> cold talking. It's kinda needed for watch too. And less...)
>
> Eh, applied anyway because it's local to a pending/ command but can't say I'm
> happy about it.

I'm not too happy with it either. I'd be happy with some help on the entire
input mechanism. But it's complicated. Not sure if this is all stuff you
know, but anyway: POSIX awk can use any single character as a record
separator (RS). If RS=="" then it's a special multiline record mode; records
are separated by one or more blank (empty) lines, and fields are separated
in the usual fashion, except if FS is exactly one character then newlines
are also field separators. (POSIX contradicts gawk and bwk awk here; patch 5
above addresses this.)

So I originally had code for "normal" RS usage and some other code for
"multiline record" mode, and both used getdelim() to read up to the RS.

But now every awk allows RS to be a regex, which is "unspecified" by POSIX.
So getdelim() can't deal. I found RS-as-regex to be tricky. You have to read
well past where the RS matches. See
 https://www.raygard.net/awkdoc/pages/awk_regex_record_separator.html if you
want to know why. Maybe there's a better way; I'm open to ideas.

(I kept the getdelim() code in the separate routine for multiline records; I
think there should be a way to merge it all together.)

But now I have to go for long reads if RS is a regex, but stop reading at a
newline if RS is a newline and the input is a tty, or the interactive
behavior is wrong.

And on interactive usage I also need not-full buffering (no- or
line-buffering) or it doesn't work right on output. But having line
buffering on for all other usage is a performance hit. But if there were a
way to specify _default_ buffering, POSIX says it will use full buffering
for non-interactive use and non-full buffering for interactive. So what
would be best for awk AFAIK is to have an option to not do the setvbuf() at
all if (e.g.) TOYFLAG_DEFAULTBUF us set, (I don't expect this to happen.)

Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
gawk (5.3.0 in my case) pass all the tests.

Ray
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Update-awk.test.patch
Type: application/x-patch
Size: 3544 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20240907/d1eb6376/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Fix-field-split-bug-when-RS-and-FS-is-one-char.patch
Type: application/x-patch
Size: 1109 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20240907/d1eb6376/attachment-0001.bin>