[Toybox] pathological case in sed s///g

Mon May 6 10:42:44 PDT 2019

Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it.
It's not in the regex man page but it _is_ in the glibc headers...

  https://github.com/bminor/glibc/commit/6fefb4e0b16

Looks like it went into glibc in 2004, which is way past 7 years. I should poke
Michael Kerrisk to update the man page.

And musl explicitly refused to do it, but Rich makes bad calls all the time:

  https://www.openwall.com/lists/musl/2013/01/15/26

I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
libc is the only perfect piece of software ever written, but I can probe for it
at compile time and #define 0 to stub out. (Unlike needing a manual config
symbol to work around his nommu insanity where he went out of his WAY to break
compile time probes, or needing to wrap syscalls myself in
https://github.com/landley/toybox/commit/833fb23fe8b4  because he decided he
didn't like a set of syscalls and _removed_ support musl had working fine at one
point...)

If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc,
which would be very nice. And since it started life as a BSD extension (they're
the only man page _documenting_ it) I get support on FreeBSD and probably MacOS too.

Cool, thanks,

Rob

On 5/6/19 10:56 AM, enh wrote:
> i think you might be able to use REG_STARTEND to avoid the implicit
> strlen: https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3&manpath=freebsd-release-ports
> 
> REG_STARTEND
>    The string is considered to start at string +
>    pmatch[0].rm_so and to end before the byte located at
>    string + pmatch[0].rm_eo, regardless of the value of
>    nmatch.  See below for the definition of pmatch and nmatch.
>    This is an extension, compatible with but not specified by
>    IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
>    tion in software intended to be portable to other systems.
>
>    Without REG_NOTBOL, the position rm_so is considered the
>    beginning of a line, such that `^' matches before it, and
>    the beginning of a word if there is a word character at
>    this position, such that `[[:<:]]' and `\<' match before
>    it.
> 
>    With REG_NOTBOL, the character at position rm_so is treated
>    as the continuation of a line, and if rm_so is greater than
>    0, the preceding character is taken into consideration.  If
>    the preceding character is a newline and the regular
>    expression was compiled with REG_NEWLINE, `^' matches
>    before the string; if the preceding character is not a word
>    character but the string starts with a word character,
>    `[[:<:]]' and `\<' match before the string.
> 
> From: Rob Landley <rob at landley.net>
> Date: Sun, May 5, 2019 at 9:41 AM
> To: enh
> Cc: toybox
> 
>> On 5/3/19 2:42 PM, enh wrote:
>>> On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
>>>>
>>>> On 5/3/19 1:56 PM, Rob Landley wrote:
>>>>> On 5/3/19 1:05 PM, enh wrote:
>>>>> But yeah, the new pessimal case after the change I'm making now would be a
>>>>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
>>>>> thing...
>>>>
>>>> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
>>>> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
>>>> know how much of the skipped data is "confirmed" and only copy it once we know
>>>> we're keeping it.
>>>
>>> yeah, that's what i meant by only ensuring that the first n characters
>>> are right.
>>>
>>>> So every kept byte is only copied once, and it can still be done in place for
>>>> the shrinking case. (You'll have to realloc when you expand, but that's probably
>>>> unavoidable...)
>>
>> Which I just did (not even trying in-place, everything gets copied once to a new
>> string), and it's still way too slow.
>>
>> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
>> time... Sigh. I forgot quite how much I hate perf, but:
>>
>>   68.60%  sed      sed                [.] regexec0
>>   29.93%  sed      [kernel.kallsyms]  [k] nmi
>>    1.47%  sed      libc-2.24.so       [.] strlen
>>    0.00%  sed      [kernel.kallsyms]  [k] __acct_update_integrals
>>
>> It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
>> the string before doing the regex each time (to figure out where the null to
>> skip to next is), that should go _after_ it tries to match this segment...
>>
>> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
>> dominating, but regexec0() is basically calling regexec() as the first thing and
>> returning immediately when it hits? The only thing that could be calling
>> strlen() is libc's regexec()? Hmmm...
>>
>> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>>          mlen, off, newlen;
>>
>>        // Loop finding match in remaining line (up to remaining len)
>> -      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>> +//      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>> +while (!strncmp(rline, "x", 1)) {
>> +  match[0].rm_eo = 1;
>> +  match[1].rm_so = 0;
>> +
>>
>> And _that_ completes a megabyte input line basically instantly. It's regexec()
>> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
>> l1 cache that's taking up all the time now.
>>
>> Grrr. So I either have to write my own regex plumbing (which I can do, I've done
>> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
>> or I need to make my wrapper special case literal matches, which means they
>> don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
>> regex"...)
>>
>>   ^|.[*+?(\{$
>>
>> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
>> length and properly match NUL bytes. On the one hand that's really a post-1.0
>> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
>> it could probably speed up grep (I could go back to searching for multiple
>> regexes in parallel... What _is_ involved in implementing lex, anyway...)
>>
>> Rob
>