[Toybox] [PATCH] awk -- fixes and cleanups

enh enh at google.com
Mon Sep 9 09:21:55 PDT 2024


On Mon, Sep 9, 2024 at 10:49 AM Rob Landley <rob at landley.net> wrote:
>
> On 9/7/24 21:04, Ray Gardner wrote:
> > On Tue, Sep 3, 2024 at 11:43 PM Rob Landley <rob at landley.net> wrote:
> >>
> >> Sorry, been sick in my Tokyo hotel bed since before the typhoon ended.
> >
> > Hope you're doing better by now.
>
> Better is a relative term, but yes. Thanks.
>
> >> On 8/30/24 15:02, Ray Gardner wrote:
> >> Patch 1: ok.
> >
> >> Patch 2: eh, not ideal, but you want an exit() with return code supplied asn
> >> argument and lib/ doesn't do that, so... eh.
> >
> > What would make it more ideal?
>
> It's a design issue: xexit() doesn't take an argument, but instead uses
> toys.exitval which lets previous failures persist (ala perror_msg() and friends
> in cp and tar and so on, error happened earlier, eventual return is nonzero
> despite later success; common pattern in command line utilities).
>
> Possibly there should be one in lib that takes a value, but how should it handle
> a zero argument? (Does that mean "exit with previous error" if there was one or
> exit with no error? And then there's the toys.rebound behavior where it has to
> read the value out of toys to see what it was because longjmp has to provide a
> nonzero value to say it's not the initial call to setjmp...)
>
> Hence set the variable, then do the exit, being two operations...
>
> >> Patch 3: You've made your input path pretty complicated to fix an issue that
> >> seems to boil down to "gracefully handle short reads".
> >
> > Discussed below because a bit long...
> >
> >> Patch 4: How is the first change a good thing? (What's the benefit?) I mean
> >> "awk's version of $RANDOM isn't remotely, so let's weaken it further for
> >> compatibility...? In a way we can't obviously figure out how to test without, I
> >> dunno, container plumbing?
> >
> > awk's version of $RANDOM isn't remotely what?
>
> Random.
>
> > The spec just says awk rand()
> > returns a non-negative pseudo-random float less than 1; it could use any
> > PRNG.  I am using POSIX random() twice for each number awk rand() generated.
> > Is random() not good? Maybe not the best, but what is not remotely about it?
> > Period about 16*(2**31-1) not enough?
>
> You switched from setting it to "fraction of a second" to setting it to "even
> second" which means unshare CLONE_NEWTIME can fairly reliably deterministically
> set the behavior. Heck, you don't even have to do that, you just experiment with
> upcoming time values and then trigger it in a known second.
>
> Your patch went from initializing the random number generator from a fraction of
> a second to an even second (which is forever in computer timme), and I don't
> know why you did that. What's the benefit?
>
> > The patch only addresses the seed when no arg is given to srand(). Yes it's
> > for compatibility,
>
> Compatibility with what?
>
> > but I'm not sure it's weaker, unless your awk code is
> > calling srand() more than once a second.
>
> If I fiddle around and find behavior I want on january 3 of next year at 3.17 pm
> and 27 seconds, I can get that behavior.
>
> Maybe this isn't a problem. I dunno the awk use cases. But I'm unclear on how
> it's an _improvement_.
>
> > The weakness IMO is where the spec
> > says the seed is 0 to 2**31-1 or behavior is unspecified.
>
> Only 4 billion possibilities. I can break into the system if I can guess it's
> dynamic IPv4 within the next 30 seconds.
>
> Not saying it's rock solid, I'm just... why? You don't even have to mask off the
> bottom bits because srand() takes an unsigned as its argument, the conversion is
> defined to do that for you...
>
> > But would you prefer toybox awk to not meet posix spec and not work like
> > gawk or Kernighan awk?
>
> Do you have a test in the test suite that srand() is initialized to the current
> second?
>
> > BTW I have written a test (attached), based on the fact that srand(srand())
> > returns unix time in seconds.
>
> And this is important to you?
>
> >> Patch 5: is this one testable? You had tests before, but not here.
> >
> > Tests attached. BTW thanks for asking for them. They turned up a bug in the
> > last patch. Fixed in first attached patch.
> >>
> >> Patch 6: sure.
> >>
> >> Patch 7: again, test would be nice.
> >
> > Tests attached.
> >
> >> Patch 3: You've made your input path pretty complicated to fix an issue that
> >> seems to boil down to "gracefully handle short reads". If getdelim() out of libc
> >> (posix-2008) somehow doesn't apply here (I am way too sick right now to try to
> >> brain through what the previous code you patched was doing) can you check each
> >> short read to see if you got whatever the delimiter is currently set to?
> >>
> >> I mean, this is what all those arguments about stdin buffer size with Elliott
> >> have been about... so you added an is_tty flag. My toysh plumbing is pulling
> >> some of those shenanigans but that's because I have to prompt the user and write
> >> my own interactive editing and command history plumbing that gnu/dammit made a
> >> whole separate library for. (And even then I wanted to make that plumbing
> >> generic and shared between commands, until the new vi implementation walled
> >> itself off from ever participating in that. I might still write an editor that
> >> behaves like nano and joe and microemacs to go alongside the bespoke vi
> >> implementation that will never share any of that code, but that could also be my
> >> cold talking. It's kinda needed for watch too. And less...)
> >>
> >> Eh, applied anyway because it's local to a pending/ command but can't say I'm
> >> happy about it.
> >
> > I'm not too happy with it either. I'd be happy with some help on the entire
> > input mechanism. But it's complicated. Not sure if this is all stuff you
> > know,
>
> What awk does under the covers? No, it is not stuff I know. If I want to match a
> regex and then do a thing on it, I generally use sed, grep, bash, C, python, java...
>
> > but anyway: POSIX awk can use any single character as a record
> > separator (RS). If RS=="" then it's a special multiline record mode; records
> > are separated by one or more blank (empty) lines, and fields are separated
> > in the usual fashion, except if FS is exactly one character then newlines
> > are also field separators. (POSIX contradicts gawk and bwk awk here; patch 5
> > above addresses this.)
>
> I've lost track of "record separator" vs "field separator" vs "line separator" here.
>
> > So I originally had code for "normal" RS usage and some other code for
> > "multiline record" mode, and both used getdelim() to read up to the RS.
> >
> > But now every awk allows RS to be a regex, which is "unspecified" by POSIX.
>
> Have you found anything using that yet? (Did it break a build script?)

yeah, that was my question too --- does anyone _care_ that existing
implementations are terrible?

i also wonder whether we can't just change that? looking at the last
change to one-true-awk, i suspect they might not be aware?

https://github.com/onetrueawk/awk/commit/c70b9fe8823e8f21628725e634d5db1a6e3948d0

  awk: Use random(3) instead of rand(3)
  While none of them is considered even near to cryptographic
  level, random(3) is a better random generator than rand(3).

  Use random(3) for awk as is done in other systems.

  Thanks to Chenguang Li for discussing this in the lists
  and submitting the patch upstream.

(although that "as is done in other systems" might also be an argument
in the opposite direction.)

> > So getdelim() can't deal. I found RS-as-regex to be tricky. You have to read
> > well past where the RS matches. See
> >  https://www.raygard.net/awkdoc/pages/awk_regex_record_separator.html if you
> > want to know why. Maybe there's a better way; I'm open to ideas.
>
> You only have to do that when the record separator IS a regex.
>
> Toybox grep has some truly disgusting "parse_regex()" plumbing that does a
> bucket sort on multiple patterns because the android guys needed a fast for a
> build script doing something eldrich and squicky.
>
> If this is legitimately a second use for that kind of horrible nonsense (I hope
> not), it could be cleaned up and repotted into lib. I note that if I get around
> to writing a lex or m4 implementation, something like that may wind up
> metastasizing into lib anyway.
>
> > (I kept the getdelim() code in the separate routine for multiline records; I
> > think there should be a way to merge it all together.)
> >
> > But now I have to go for long reads if RS is a regex, but stop reading at a
> > newline if RS is a newline and the input is a tty, or the interactive
> > behavior is wrong.
>
> Is there a way to ONLY penalize the regex case? I don't mind "you shouldn't be
> doing this at all ever and it's not in posix" being a slow path, but I don't
> have enough domain expertise to have a strong opinion here other than being
> uncomfortable with "all input is currently very complicated".
>
> > And on interactive usage I also need not-full buffering (no- or
> > line-buffering) or it doesn't work right on output. But having line
> > buffering on for all other usage is a performance hit. But if there were a
> > way to specify _default_ buffering, POSIX says it will use full buffering
> > for non-interactive use and non-full buffering for interactive. So what
> > would be best for awk AFAIK is to have an option to not do the setvbuf() at
> > all if (e.g.) TOYFLAG_DEFAULTBUF us set, (I don't expect this to happen.)
>
> That's what I was initially doing, and Elliott didn't like it.

no you weren't --- "just let libc do its thing, and only override it
where you have an explicit motivation" was my suggested fix too.

specifically, if you look at
https://github.com/landley/toybox/commit/3e0e8c687eee4c292db106efab20f8e565549bf8
you see the old behavior was

    setvbuf(stdout, 0, (which->flags & TOYFLAG_LINEBUF) ? _IOLBF : _IONBF, 0);

aka "line buffering or _no_ buffering [based on toy flag]".

and the new behavior is

    setvbuf(stdout, buf, buf ? _IOFBF : _IOLBF, buf ? 4096 : 0);

aka "line buffering or full buffering [based on toy flag]".

whereas the default libc behavior (which is what i think raygard was
suggesting) is "line buffering or full buffering _based on isatty()_".

don't get me wrong --- i agree that there's a need for toys to be able
to override, and we've even seen microcom wanting _no_ buffering, but
"full or line, depending on isatty()" is a useful option too, which is
why it's the libc default. (via misguided horrors like "reading from
stdin implies a flush in most libcs because so much code assumes the
login prompt gets flushed when you try to read", though the musl guy
seems determined to fight that one and just call such code broken :-)
)

> Having a DEFAULT flag that is not the default is just... ow?

_if_ you went this way, it could be the default though, no? "in the
absence of an explicit TOYFLAG_(FULL|NO|LINE)BUF, we let libc decide"
is definitely an option. (and it's the one i had in mind when we first
started having these discussions.)

i don't know whether that would lead to an increase or decrease in
manual overrides though...

toybox$ grep -r TOYFLAG_.*BUF toys
toys/other/ascii.c:USE_ASCII(NEWTOY(ascii, 0,
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/blkid.c:USE_BLKID(NEWTOY(blkid, "ULo:s*[!LU]",
TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/blkid.c:USE_FSTYPE(NEWTOY(fstype, "<1", TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/vmstat.c:USE_VMSTAT(NEWTOY(vmstat, ">2n",
TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/ts.c:USE_TS(NEWTOY(ts, "ims",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/base64.c:USE_BASE64(NEWTOY(base64, "diw#<0=76[!dw]",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/base64.c:USE_BASE32(NEWTOY(base32, "diw#<0=76[!dw]",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/other/inotifyd.c:USE_INOTIFYD(NEWTOY(inotifyd, "<2",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/posix/tail.c:USE_TAIL(NEWTOY(tail,
"?fFs:c(bytes)-n(lines)-[-cn][-fF]",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/posix/head.c:USE_HEAD(NEWTOY(head,
"?n(lines)#<0=10c(bytes)#<0qv[-nc]",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/posix/grep.c:USE_GREP(NEWTOY(grep,
"(line-buffered)(color):;(exclude-dir)*S(exclude)*M(include)*ZzEFHIab(byte-offset)h(no-filename)ino(only-matching)rRsvwc(count)L(files-without-match)l(files-with-matches)q(quiet)(silent)e*f*C#B#A#m#x[!wx][!EF]",
TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF|TOYFLAG_AUTOCONF))
toys/posix/grep.c:USE_EGREP(OLDTOY(egrep, grep,
TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF|TOYFLAG_AUTOCONF))
toys/posix/grep.c:USE_FGREP(OLDTOY(fgrep, grep,
TOYFLAG_BIN|TOYFLAG_ARGFAIL(2)|TOYFLAG_LINEBUF|TOYFLAG_AUTOCONF))
toys/posix/echo.c:USE_ECHO(NEWTOY(echo, "^?Een[-eE]",
TOYFLAG_BIN|TOYFLAG_MAYFORK|TOYFLAG_LINEBUF))
toys/net/microcom.c:USE_MICROCOM(NEWTOY(microcom, "<1>1s#X",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_NOBUF))
toys/net/httpd.c:USE_HTTPD(NEWTOY(httpd, ">1v",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/net/ping.c:USE_PING(NEWTOY(ping,
"<1>1m#t#<0>255=64c#<0=3s#<0>4064=56i%W#<0=3w#<0qf46I:[-46]",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/net/ping.c:USE_PING(OLDTOY(ping6, ping,
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/lsb/dmesg.c:USE_DMESG(NEWTOY(dmesg,
"w(follow)W(follow-new)CSTtrs#<1n#c[!Ttr][!Cc][!SWw]",
TOYFLAG_BIN|TOYFLAG_LINEBUF))
toys/pending/awk.c:USE_AWK(NEWTOY(awk, "F:v*f*bc",
TOYFLAG_USR|TOYFLAG_BIN|TOYFLAG_LINEBUF))

many of those do look like "this is actually mostly used
interactively, so we hard-code line buffering" that would probably go
away if we just went with the libc default as the toybox default?

(at the risk of presenting a strawman argument, iirc you didn't like
that because you think there should be a timeout in the buffering? but
i don't think any of the alternatives short of "write your own stdio
replacement" get us anywhere closer to that. and -- while there's some
merit to the idea in principle as a user -- i can't think how to
implement it without signals or threads, and as a libc maintainer both
of those make me think i'm heading into "now i have _two_ problems"
territory :-) )

tbf to both of us, raygard's awk case might be the strongest argument
we've seen yet for the libc default being the toybox default?

> > Anyway, here is a fix and a bunch of tests. BTW only toybox awk and a recent
> > gawk (5.3.0 in my case) pass all the tests.
>
> When gnu adds a new creepy thing nobody else supports yet, I generally wait for
> something to break due to its absence rather than jumping through hoops to chase
> their taillights. Right now, busybox awk is used to build rather a lot of
> packages, and if they haven't made the decision to implement this weirdness than
> packages are likely to continue to support NOT doing that...
>
> > Ray
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net


More information about the Toybox mailing list