[Toybox] grep and empty regexes

Rob Landley rob at landley.net
Mon Jul 29 04:21:39 PDT 2019


On 7/28/19 12:32 PM, enh via Toybox wrote:
> any thoughts on this? the choices seem to be:

My head is full of shell plumbing. (This should succeed: { ln -s /dev/null blah;
set -o noclobber; echo > blah ;} but this should fail: { touch walrus; ln -s
walrus penguin; set -o noclobber; echo > penguin;})

> * keep BSD behavior on BSD libc systems, GNU behavior on GNU libc

Your timing is impeccable. Guess what the gmail spurious delivery failure due to
spam false positive du jour was? (Other than me. I get unsubscribed to my own
list and have to send a "confirm" email twice a week. Yes, I need to move off
gmail, but I'm kinda busy...)

This is a Mailman mailing list bounce action notice:

    List:       Toybox
    Member:     emaste at freebsd.org
    Action:     Subscription disabled.
    Reason:     Excessive or fatal bounces.

12:27 PM today. So our resident BSD expert probably didn't see it. I'll cc: him...

> systems. the only toybox change is to tweak the tests to have the
> right expectations for the system they're on, probably by checking
> whether -E with a leading + is an error or not? (because there are
> inherently going to be some differences, and the leading + is one of
> them.) i can send a patch for that if you'd prefer to go that way.

Do you have a catalog of what the differences _are_? (Empty regex and leading +
so far?)

My long-term goal is to try to build software packages under a bionic root
filesystem.

> * make the GNU behavior an error everywhere. (i.e. check for the empty
> regex and reject it.) doesn't address other issues (like leading +).

The empty regex is a thing I think I've actually seen in package builds, which
is why I had a test for it? I think?

But it was long enough ago, I don't remember _which_ package(s), and my "plug
modern toybox into the last aboriginal linux release and try to build all those
old LFS packages with it" effort got derailed by switching laptops a couple
months back. It's on the old machine, about 2/3 finished. (This is why I need to
get mkroot up to being a proper aboriginal linux replacement, and the blockers
for that are sh and route. Well, and make, but I can break down and build the
old one for the moment...)

(P.S. The very first bug I hit trying to use busybox in place of Linux From
Scratch chapter 5 was that autoconf hung because awk with an empty pattern was
supposed to act like cat, and busybox circa 2002 didn't. There was a lot of
"empty pattern makes the command a NOP" laziness in package build plumbing back
then...)

> * try to work around BSD behavior (this patch).

Working around the bsd behavior in grep sounds like the best option so far, _if_
the workarounds are small. The more extensive the divergence is, the less
palatable that is.

> it might come down to "where did these tests come from?" --- did you
> hit these in practice somewhere, or was this just you poking at corner
> cases and wondering what happens if you supply an empty regex? (for
> obvious reasons it's a bit tricky for me to search for uses of an
> empty regex :-) )

I vaguely recall it was a package build? Probably Horrible Autoconf plumbing?
But I'm not sure. I remember I was on a bus coming home from the airport, and
that bus route stops at an ACC campus along the way for 15 minutes, and _that's_
where I was when I made empty regexes in grep work. But I don't remember why. (I
was also arguing with Rich Felker about something in email, but that doesn't
narrow it down in the slightest.)

This is why I blog about things, but even with stuff like:

  http://landley.net/notes-2016.html#24-07-2016

there's still unanswered questions. :(

>> not sure what to do here, in particular because -- given your tests --
>> i don't think we can represent the GNU interpretation as a POSIX
>> regular expression?

The problem _I_ was fighting with is if you have multiple regexes and want to do
it in a single pass so you're not O(N) iterating over a potentially long string
and thrashing your cache, you basically need (regex)|(regex)|(regex) which you
can only do with _extended_ regular expressions. EXCEPT that in the gnu plumbing
you can use \| in a non-extended regex and it works. But it didn't work in musl,
and Rich refused to add it. (And now it wouldn't have worked in bionic. And Rich
pointed out that \1 and friends would get the numbers wrong for regexes after
the first.)

Eventually I tore it out and did the multiple passes thing, to be right rather
than fast.

Anyway, I applied your patch.

I'm glad you're giving the test suite some attention, but you're polishing
what's _there_ and my concerns are more about what _isn't_. Longer term I need
to A) write a gazillion more tests (based on a close re-reading of the spec, the
relevant man page or RFC, and the source code), B) get mkroot to the point I can
run more sheep across the minefield and catch issues with real world data.

Both are giant time sinks, and at the moment I'm working on B. :)

Thanks,

Rob

P.S. And gmail unsubscribed 8 other people from the list, all in response to
http://lists.landley.net/pipermail/toybox-landley.net/2019-July/010726.html
refusing to be delivered to anything gmail. I'm kind of surprised there are
still 8 gmail subscribers _on_ the list, given that I've stopped re-subscribing
them through the web interface every time this happens. I apologize to the lot
of them, who probably won't see this email because it doesn't backfill when you
do send in the confirmation. You have to check the web archive to see what
you've missed...



More information about the Toybox mailing list