[Toybox] pathological case in sed s///g

Mon May 6 08:56:53 PDT 2019

i think you might be able to use REG_STARTEND to avoid the implicit
strlen: https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3&manpath=freebsd-release-ports

REG_STARTEND
   The string is considered to start at string +
   pmatch[0].rm_so and to end before the byte located at
   string + pmatch[0].rm_eo, regardless of the value of
   nmatch.  See below for the definition of pmatch and nmatch.
   This is an extension, compatible with but not specified by
   IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
   tion in software intended to be portable to other systems.

   Without REG_NOTBOL, the position rm_so is considered the
   beginning of a line, such that `^' matches before it, and
   the beginning of a word if there is a word character at
   this position, such that `[[:<:]]' and `\<' match before
   it.

   With REG_NOTBOL, the character at position rm_so is treated
   as the continuation of a line, and if rm_so is greater than
   0, the preceding character is taken into consideration.  If
   the preceding character is a newline and the regular
   expression was compiled with REG_NEWLINE, `^' matches
   before the string; if the preceding character is not a word
   character but the string starts with a word character,
   `[[:<:]]' and `\<' match before the string.

From: Rob Landley <rob at landley.net>
Date: Sun, May 5, 2019 at 9:41 AM
To: enh
Cc: toybox

> On 5/3/19 2:42 PM, enh wrote:
> > On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
> >>
> >> On 5/3/19 1:56 PM, Rob Landley wrote:
> >>> On 5/3/19 1:05 PM, enh wrote:
> >>> But yeah, the new pessimal case after the change I'm making now would be a
> >>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
> >>> thing...
> >>
> >> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
> >> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
> >> know how much of the skipped data is "confirmed" and only copy it once we know
> >> we're keeping it.
> >
> > yeah, that's what i meant by only ensuring that the first n characters
> > are right.
> >
> >> So every kept byte is only copied once, and it can still be done in place for
> >> the shrinking case. (You'll have to realloc when you expand, but that's probably
> >> unavoidable...)
>
> Which I just did (not even trying in-place, everything gets copied once to a new
> string), and it's still way too slow.
>
> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
> time... Sigh. I forgot quite how much I hate perf, but:
>
>   68.60%  sed      sed                [.] regexec0
>   29.93%  sed      [kernel.kallsyms]  [k] nmi
>    1.47%  sed      libc-2.24.so       [.] strlen
>    0.00%  sed      [kernel.kallsyms]  [k] __acct_update_integrals
>
> It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
> the string before doing the regex each time (to figure out where the null to
> skip to next is), that should go _after_ it tries to match this segment...
>
> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
> dominating, but regexec0() is basically calling regexec() as the first thing and
> returning immediately when it hits? The only thing that could be calling
> strlen() is libc's regexec()? Hmmm...
>
> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>          mlen, off, newlen;
>
>        // Loop finding match in remaining line (up to remaining len)
> -      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> +//      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
> +while (!strncmp(rline, "x", 1)) {
> +  match[0].rm_eo = 1;
> +  match[1].rm_so = 0;
> +
>
> And _that_ completes a megabyte input line basically instantly. It's regexec()
> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
> l1 cache that's taking up all the time now.
>
> Grrr. So I either have to write my own regex plumbing (which I can do, I've done
> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
> or I need to make my wrapper special case literal matches, which means they
> don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
> regex"...)
>
>   ^|.[*+?(\{$
>
> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
> length and properly match NUL bytes. On the one hand that's really a post-1.0
> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
> it could probably speed up grep (I could go back to searching for multiple
> regexes in parallel... What _is_ involved in implementing lex, anyway...)
>
> Rob