[Toybox] pathological case in sed s///g

Wed May 8 17:52:22 PDT 2019

(i checked and macOS does have both REG_STARTEND and REG_PEND.)

From: Rob Landley <rob at landley.net>
Date: Mon, May 6, 2019 at 10:42 AM
To: enh
Cc: toybox, Rich Felker

> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it.
> It's not in the regex man page but it _is_ in the glibc headers...
>
>   https://github.com/bminor/glibc/commit/6fefb4e0b16
>
> Looks like it went into glibc in 2004, which is way past 7 years. I should poke
> Michael Kerrisk to update the man page.
>
> And musl explicitly refused to do it, but Rich makes bad calls all the time:
>
>   https://www.openwall.com/lists/musl/2013/01/15/26
>
> I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
> libc is the only perfect piece of software ever written, but I can probe for it
> at compile time and #define 0 to stub out. (Unlike needing a manual config
> symbol to work around his nommu insanity where he went out of his WAY to break
> compile time probes, or needing to wrap syscalls myself in
> https://github.com/landley/toybox/commit/833fb23fe8b4  because he decided he
> didn't like a set of syscalls and _removed_ support musl had working fine at one
> point...)
>
> If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc,
> which would be very nice. And since it started life as a BSD extension (they're
> the only man page _documenting_ it) I get support on FreeBSD and probably MacOS too.
>
> Cool, thanks,
>
> Rob
>
> On 5/6/19 10:56 AM, enh wrote:
> > i think you might be able to use REG_STARTEND to avoid the implicit
> > strlen: https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3&manpath=freebsd-release-ports
> >
> > REG_STARTEND
> >    The string is considered to start at string +
> >    pmatch[0].rm_so and to end before the byte located at
> >    string + pmatch[0].rm_eo, regardless of the value of
> >    nmatch.  See below for the definition of pmatch and nmatch.
> >    This is an extension, compatible with but not specified by
> >    IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
> >    tion in software intended to be portable to other systems.
> >
> >    Without REG_NOTBOL, the position rm_so is considered the
> >    beginning of a line, such that `^' matches before it, and
> >    the beginning of a word if there is a word character at
> >    this position, such that `[[:<:]]' and `\<' match before
> >    it.
> >
> >    With REG_NOTBOL, the character at position rm_so is treated
> >    as the continuation of a line, and if rm_so is greater than
> >    0, the preceding character is taken into consideration.  If
> >    the preceding character is a newline and the regular
> >    expression was compiled with REG_NEWLINE, `^' matches
> >    before the string; if the preceding character is not a word
> >    character but the string starts with a word character,
> >    `[[:<:]]' and `\<' match before the string.
> >
> > From: Rob Landley <rob at landley.net>
> > Date: Sun, May 5, 2019 at 9:41 AM
> > To: enh
> > Cc: toybox
> >
> >> On 5/3/19 2:42 PM, enh wrote:
> >>> On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
> >>>>
> >>>> On 5/3/19 1:56 PM, Rob Landley wrote:
> >>>>> On 5/3/19 1:05 PM, enh wrote:
> >>>>> But yeah, the new pessimal case after the change I'm making now would be a
> >>>>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
> >>>>> thing...
> >>>>
> >>>> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
> >>>> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
> >>>> know how much of the skipped data is "confirmed" and only copy it once we know
> >>>> we're keeping it.
> >>>
> >>> yeah, that's what i meant by only ensuring that the first n characters
> >>> are right.
> >>>
> >>>> So every kept byte is only copied once, and it can still be done in place for
> >>>> the shrinking case. (You'll have to realloc when you expand, but that's probably
> >>>> unavoidable...)
> >>
> >> Which I just did (not even trying in-place, everything gets copied once to a new
> >> string), and it's still way too slow.
> >>
> >> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
> >> time... Sigh. I forgot quite how much I hate perf, but:
> >>
> >>   68.60%  sed      sed                [.] regexec0
> >>   29.93%  sed      [kernel.kallsyms]  [k] nmi
> >>    1.47%  sed      libc-2.24.so       [.] strlen
> >>    0.00%  sed      [kernel.kallsyms]  [k] __acct_update_integrals
> >>
> >> It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
> >> the string before doing the regex each time (to figure out where the null to
> >> skip to next is), that should go _after_ it tries to match this segment...
> >>
> >> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
> >> dominating, but regexec0() is basically calling regexec() as the first thing and
> >> returning immediately when it hits? The only thing that could be calling
> >> strlen() is libc's regexec()? Hmmm...
> >>
> >> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
> >>          mlen, off, newlen;
> >>
> >>        // Loop finding match in remaining line (up to remaining len)
> >> -      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> >> +//      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> >> +while (!strncmp(rline, "x", 1)) {
> >> +  match[0].rm_eo = 1;
> >> +  match[1].rm_so = 0;
> >> +
> >>
> >> And _that_ completes a megabyte input line basically instantly. It's regexec()
> >> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
> >> l1 cache that's taking up all the time now.
> >>
> >> Grrr. So I either have to write my own regex plumbing (which I can do, I've done
> >> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
> >> or I need to make my wrapper special case literal matches, which means they
> >> don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
> >> regex"...)
> >>
> >>   ^|.[*+?(\{$
> >>
> >> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
> >> length and properly match NUL bytes. On the one hand that's really a post-1.0
> >> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
> >> it could probably speed up grep (I could go back to searching for multiple
> >> regexes in parallel... What _is_ involved in implementing lex, anyway...)
> >>
> >> Rob
> >