[Toybox] [PATCH] awk -- fixes and cleanups

Rob Landley rob at landley.net
Mon Sep 9 08:04:09 PDT 2024


On 9/7/24 21:04, Ray Gardner wrote:
> On Tue, Sep 3, 2024 at 11:43 PM Rob Landley <rob at landley.net> wrote:
>>
>> Sorry, been sick in my Tokyo hotel bed since before the typhoon ended.
> 
> Hope you're doing better by now.

Better is a relative term, but yes. Thanks.

>> On 8/30/24 15:02, Ray Gardner wrote:
>> Patch 1: ok.
> 
>> Patch 2: eh, not ideal, but you want an exit() with return code supplied asn
>> argument and lib/ doesn't do that, so... eh.
> 
> What would make it more ideal?

It's a design issue: xexit() doesn't take an argument, but instead uses
toys.exitval which lets previous failures persist (ala perror_msg() and friends
in cp and tar and so on, error happened earlier, eventual return is nonzero
despite later success; common pattern in command line utilities).

Possibly there should be one in lib that takes a value, but how should it handle
a zero argument? (Does that mean "exit with previous error" if there was one or
exit with no error? And then there's the toys.rebound behavior where it has to
read the value out of toys to see what it was because longjmp has to provide a
nonzero value to say it's not the initial call to setjmp...)

Hence set the variable, then do the exit, being two operations...

>> Patch 3: You've made your input path pretty complicated to fix an issue that
>> seems to boil down to "gracefully handle short reads".
> 
> Discussed below because a bit long...
> 
>> Patch 4: How is the first change a good thing? (What's the benefit?) I mean
>> "awk's version of $RANDOM isn't remotely, so let's weaken it further for
>> compatibility...? In a way we can't obviously figure out how to test without, I
>> dunno, container plumbing?
> 
> awk's version of $RANDOM isn't remotely what?

Random.

> The spec just says awk rand()
> returns a non-negative pseudo-random float less than 1; it could use any
> PRNG.  I am using POSIX random() twice for each number awk rand() generated.
> Is random() not good? Maybe not the best, but what is not remotely about it?
> Period about 16*(2**31-1) not enough?

You switched from setting it to "fraction of a second" to setting it to "even
second" which means unshare CLONE_NEWTIME can fairly reliably deterministically
set the behavior. Heck, you don't even have to do that, you just experiment with
upcoming time values and then trigger it in a known second.

Your patch went from initializing the random number generator from a fraction of
a second to an even second (which is forever in computer timme), and I don't
know why you did that. What's the benefit?

> The patch only addresses the seed when no arg is given to srand(). Yes it's
> for compatibility,

Compatibility with what?

> but I'm not sure it's weaker, unless your awk code is
> calling srand() more than once a second.

If I fiddle around and find behavior I want on january 3 of next year at 3.17 pm
and 27 seconds, I can get that behavior.

Maybe this isn't a problem. I dunno the awk use cases. But I'm unclear on how
it's an _improvement_.

> The weakness IMO is where the spec
> says the seed is 0 to 2**31-1 or behavior is unspecified.

Only 4 billion possibilities. I can break into the system if I can guess it's
dynamic IPv4 within the next 30 seconds.

Not saying it's rock solid, I'm just... why? You don't even have to mask off the
bottom bits because srand() takes an unsigned as its argument, the conversion is
defined to do that for you...

> But would you prefer toybox awk to not meet posix spec and not work like
> gawk or Kernighan awk?

Do you have a test in the test suite that srand() is initialized to the current
second?

> BTW I have written a test (attached), based on the fact that srand(srand())
> returns unix time in seconds.

And this is important to you?

>> Patch 5: is this one testable? You had tests before, but not here.
> 
> Tests attached. BTW thanks for asking for them. They turned up a bug in the
> last patch. Fixed in first attached patch.
>>
>> Patch 6: sure.
>>
>> Patch 7: again, test would be nice.
> 
> Tests attached.
> 
>> Patch 3: You've made your input path pretty complicated to fix an issue that
>> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
>> (posix-2008) somehow doesn't apply here (I am way too sick right now to try to
>> brain through what the previous code you patched was doing) can you check each
>> short read to see if you got whatever the delimiter is currently set to?
>>
>> I mean, this is what all those arguments about stdin buffer size with Elliott
>> have been about... so you added an is_tty flag. My toysh plumbing is pulling
>> some of those shenanigans but that's because I have to prompt the user and write
>> my own interactive editing and command history plumbing that gnu/dammit made a
>> whole separate library for. (And even then I wanted to make that plumbing
>> generic and shared between commands, until the new vi implementation walled
>> itself off from ever participating in that. I might still write an editor that
>> behaves like nano and joe and microemacs to go alongside the bespoke vi
>> implementation that will never share any of that code, but that could also be my
>> cold talking. It's kinda needed for watch too. And less...)
>>
>> Eh, applied anyway because it's local to a pending/ command but can't say I'm
>> happy about it.
> 
> I'm not too happy with it either. I'd be happy with some help on the entire
> input mechanism. But it's complicated. Not sure if this is all stuff you
> know,

What awk does under the covers? No, it is not stuff I know. If I want to match a
regex and then do a thing on it, I generally use sed, grep, bash, C, python, java...

> but anyway: POSIX awk can use any single character as a record
> separator (RS). If RS=="" then it's a special multiline record mode; records
> are separated by one or more blank (empty) lines, and fields are separated
> in the usual fashion, except if FS is exactly one character then newlines
> are also field separators. (POSIX contradicts gawk and bwk awk here; patch 5
> above addresses this.)

I've lost track of "record separator" vs "field separator" vs "line separator" here.

> So I originally had code for "normal" RS usage and some other code for
> "multiline record" mode, and both used getdelim() to read up to the RS.
> 
> But now every awk allows RS to be a regex, which is "unspecified" by POSIX.

Have you found anything using that yet? (Did it break a build script?)

> So getdelim() can't deal. I found RS-as-regex to be tricky. You have to read
> well past where the RS matches. See
>  https://www.raygard.net/awkdoc/pages/awk_regex_record_separator.html if you
> want to know why. Maybe there's a better way; I'm open to ideas.

You only have to do that when the record separator IS a regex.

Toybox grep has some truly disgusting "parse_regex()" plumbing that does a
bucket sort on multiple patterns because the android guys needed a fast for a
build script doing something eldrich and squicky.

If this is legitimately a second use for that kind of horrible nonsense (I hope
not), it could be cleaned up and repotted into lib. I note that if I get around
to writing a lex or m4 implementation, something like that may wind up
metastasizing into lib anyway.

> (I kept the getdelim() code in the separate routine for multiline records; I
> think there should be a way to merge it all together.)
> 
> But now I have to go for long reads if RS is a regex, but stop reading at a
> newline if RS is a newline and the input is a tty, or the interactive
> behavior is wrong.

Is there a way to ONLY penalize the regex case? I don't mind "you shouldn't be
doing this at all ever and it's not in posix" being a slow path, but I don't
have enough domain expertise to have a strong opinion here other than being
uncomfortable with "all input is currently very complicated".

> And on interactive usage I also need not-full buffering (no- or
> line-buffering) or it doesn't work right on output. But having line
> buffering on for all other usage is a performance hit. But if there were a
> way to specify _default_ buffering, POSIX says it will use full buffering
> for non-interactive use and non-full buffering for interactive. So what
> would be best for awk AFAIK is to have an option to not do the setvbuf() at
> all if (e.g.) TOYFLAG_DEFAULTBUF us set, (I don't expect this to happen.)

That's what I was initially doing, and Elliott didn't like it.

Having a DEFAULT flag that is not the default is just... ow?

> Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
> gawk (5.3.0 in my case) pass all the tests.

When gnu adds a new creepy thing nobody else supports yet, I generally wait for
something to break due to its absence rather than jumping through hoops to chase
their taillights. Right now, busybox awk is used to build rather a lot of
packages, and if they haven't made the decision to implement this weirdness than
packages are likely to continue to support NOT doing that...

> Ray

Rob


More information about the Toybox mailing list