[Toybox] pathological case in sed s///g

Sun May 5 09:41:26 PDT 2019

On 5/3/19 2:42 PM, enh wrote:
> On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
>>
>> On 5/3/19 1:56 PM, Rob Landley wrote:
>>> On 5/3/19 1:05 PM, enh wrote:
>>> But yeah, the new pessimal case after the change I'm making now would be a
>>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
>>> thing...
>>
>> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
>> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
>> know how much of the skipped data is "confirmed" and only copy it once we know
>> we're keeping it.
> 
> yeah, that's what i meant by only ensuring that the first n characters
> are right.
> 
>> So every kept byte is only copied once, and it can still be done in place for
>> the shrinking case. (You'll have to realloc when you expand, but that's probably
>> unavoidable...)

Which I just did (not even trying in-place, everything gets copied once to a new
string), and it's still way too slow.

Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
time... Sigh. I forgot quite how much I hate perf, but:

  68.60%  sed      sed                [.] regexec0
  29.93%  sed      [kernel.kallsyms]  [k] nmi
   1.47%  sed      libc-2.24.so       [.] strlen
   0.00%  sed      [kernel.kallsyms]  [k] __acct_update_integrals

It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
the string before doing the regex each time (to figure out where the null to
skip to next is), that should go _after_ it tries to match this segment...

Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
dominating, but regexec0() is basically calling regexec() as the first thing and
returning immediately when it hits? The only thing that could be calling
strlen() is libc's regexec()? Hmmm...

@@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
         mlen, off, newlen;

       // Loop finding match in remaining line (up to remaining len)
-      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
+//      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
+while (!strncmp(rline, "x", 1)) {
+  match[0].rm_eo = 1;
+  match[1].rm_so = 0;
+

And _that_ completes a megabyte input line basically instantly. It's regexec()
calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
l1 cache that's taking up all the time now.

Grrr. So I either have to write my own regex plumbing (which I can do, I've done
it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
or I need to make my wrapper special case literal matches, which means they
don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
regex"...)

  ^|.[*+?(\{$

Sigh. If I _was_ going to write my own regex plumbing I could make it take a
length and properly match NUL bytes. On the one hand that's really a post-1.0
todo item. On the other hand we're talking maybe 200 lines of code here, _and_
it could probably speed up grep (I could go back to searching for multiple
regexes in parallel... What _is_ involved in implementing lex, anyway...)

Rob