[Toybox] grep and empty regexes

enh enh at google.com
Mon Jul 29 11:59:15 PDT 2019


On Mon, Jul 29, 2019 at 4:19 AM Rob Landley <rob at landley.net> wrote:
>
> On 7/28/19 12:32 PM, enh via Toybox wrote:
> > any thoughts on this? the choices seem to be:
>
> My head is full of shell plumbing. (This should succeed: { ln -s /dev/null blah;
> set -o noclobber; echo > blah ;} but this should fail: { touch walrus; ln -s
> walrus penguin; set -o noclobber; echo > penguin;})
>
> > * keep BSD behavior on BSD libc systems, GNU behavior on GNU libc
>
> Your timing is impeccable. Guess what the gmail spurious delivery failure due to
> spam false positive du jour was? (Other than me. I get unsubscribed to my own
> list and have to send a "confirm" email twice a week. Yes, I need to move off
> gmail, but I'm kinda busy...)
>
> This is a Mailman mailing list bounce action notice:
>
>     List:       Toybox
>     Member:     emaste at freebsd.org
>     Action:     Subscription disabled.
>     Reason:     Excessive or fatal bounces.
>
> 12:27 PM today. So our resident BSD expert probably didn't see it. I'll cc: him...
>
> > systems. the only toybox change is to tweak the tests to have the
> > right expectations for the system they're on, probably by checking
> > whether -E with a leading + is an error or not? (because there are
> > inherently going to be some differences, and the leading + is one of
> > them.) i can send a patch for that if you'd prefer to go that way.
>
> Do you have a catalog of what the differences _are_? (Empty regex and leading +
> so far?)

no, i've got no idea. seems like there are bits and bobs in various
GNU docs (about the tools themselves and libc), but i didn't find
anything detailed enough to include either of these.
https://www.regular-expressions.info/gnu.html is horrifically garish,
but does seem to be a good condensation of the GNU info i found from
other sources.

> My long-term goal is to try to build software packages under a bionic root
> filesystem.
>
> > * make the GNU behavior an error everywhere. (i.e. check for the empty
> > regex and reject it.) doesn't address other issues (like leading +).
>
> The empty regex is a thing I think I've actually seen in package builds, which
> is why I had a test for it? I think?
>
> But it was long enough ago, I don't remember _which_ package(s), and my "plug
> modern toybox into the last aboriginal linux release and try to build all those
> old LFS packages with it" effort got derailed by switching laptops a couple
> months back. It's on the old machine, about 2/3 finished. (This is why I need to
> get mkroot up to being a proper aboriginal linux replacement, and the blockers
> for that are sh and route. Well, and make, but I can break down and build the
> old one for the moment...)
>
> (P.S. The very first bug I hit trying to use busybox in place of Linux From
> Scratch chapter 5 was that autoconf hung because awk with an empty pattern was
> supposed to act like cat, and busybox circa 2002 didn't. There was a lot of
> "empty pattern makes the command a NOP" laziness in package build plumbing back
> then...)
>
> > * try to work around BSD behavior (this patch).
>
> Working around the bsd behavior in grep sounds like the best option so far, _if_
> the workarounds are small. The more extensive the divergence is, the less
> palatable that is.

would you like me to move the workaround into xregcomp instead? or do
you want to wait until we see someone need this in sed or wherever
first?

> > it might come down to "where did these tests come from?" --- did you
> > hit these in practice somewhere, or was this just you poking at corner
> > cases and wondering what happens if you supply an empty regex? (for
> > obvious reasons it's a bit tricky for me to search for uses of an
> > empty regex :-) )
>
> I vaguely recall it was a package build? Probably Horrible Autoconf plumbing?
> But I'm not sure. I remember I was on a bus coming home from the airport, and
> that bus route stops at an ACC campus along the way for 15 minutes, and _that's_
> where I was when I made empty regexes in grep work. But I don't remember why. (I
> was also arguing with Rich Felker about something in email, but that doesn't
> narrow it down in the slightest.)
>
> This is why I blog about things, but even with stuff like:
>
>   http://landley.net/notes-2016.html#24-07-2016
>
> there's still unanswered questions. :(
>
> >> not sure what to do here, in particular because -- given your tests --
> >> i don't think we can represent the GNU interpretation as a POSIX
> >> regular expression?
>
> The problem _I_ was fighting with is if you have multiple regexes and want to do
> it in a single pass so you're not O(N) iterating over a potentially long string
> and thrashing your cache, you basically need (regex)|(regex)|(regex) which you
> can only do with _extended_ regular expressions. EXCEPT that in the gnu plumbing
> you can use \| in a non-extended regex and it works. But it didn't work in musl,
> and Rich refused to add it. (And now it wouldn't have worked in bionic. And Rich
> pointed out that \1 and friends would get the numbers wrong for regexes after
> the first.)
>
> Eventually I tore it out and did the multiple passes thing, to be right rather
> than fast.
>
> Anyway, I applied your patch.

thanks.

> I'm glad you're giving the test suite some attention, but you're polishing
> what's _there_ and my concerns are more about what _isn't_. Longer term I need
> to A) write a gazillion more tests (based on a close re-reading of the spec, the
> relevant man page or RFC, and the source code), B) get mkroot to the point I can
> run more sheep across the minefield and catch issues with real world data.

oh, yeah, *coverage* is a huge blind spot for us at the moment, and
something i want to look at. but first i wanted to get as many of the
tests running in presubmit as possible, to maximize my
pain^W^W^Wminimize the number of bugs that make it through to folks
who're just trying to build AOSP/just trying to use a device. not
every OEM is as keen on the idea of a reproduceable hermetic build as
you might expect, so giving them fewer stones to throw seems like a
good idea :-)

i'm down to just blkid and du failures on a local taimen device now...

it's time to admit to myself i'm not likely to implement the ntfs
LABEL support any time soon, and at least send you a patch that fixes
all the other issues. (sent separately.)

as for du versus the extra space used for extended attributes, i'm
still not sure what to do about that...

> Both are giant time sinks, and at the moment I'm working on B. :)
>
> Thanks,
>
> Rob
>
> P.S. And gmail unsubscribed 8 other people from the list, all in response to
> http://lists.landley.net/pipermail/toybox-landley.net/2019-July/010726.html
> refusing to be delivered to anything gmail. I'm kind of surprised there are
> still 8 gmail subscribers _on_ the list, given that I've stopped re-subscribing
> them through the web interface every time this happens. I apologize to the lot
> of them, who probably won't see this email because it doesn't backfill when you
> do send in the confirmation. You have to check the web archive to see what
> you've missed...



More information about the Toybox mailing list