[Toybox] pathological case in sed s///g

Wed May 8 22:08:31 PDT 2019

I emailed Michael Kerrisk to update the man page to mention it...

Rob

On 5/8/19 7:52 PM, enh wrote:
> (i checked and macOS does have both REG_STARTEND and REG_PEND.)
> 
> From: Rob Landley <rob at landley.net>
> Date: Mon, May 6, 2019 at 10:42 AM
> To: enh
> Cc: toybox, Rich Felker
> 
>> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it.
>> It's not in the regex man page but it _is_ in the glibc headers...
>>
>>   https://github.com/bminor/glibc/commit/6fefb4e0b16
>>
>> Looks like it went into glibc in 2004, which is way past 7 years. I should poke
>> Michael Kerrisk to update the man page.
>>
>> And musl explicitly refused to do it, but Rich makes bad calls all the time:
>>
>>   https://www.openwall.com/lists/musl/2013/01/15/26
>>
>> I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
>> libc is the only perfect piece of software ever written, but I can probe for it
>> at compile time and #define 0 to stub out. (Unlike needing a manual config
>> symbol to work around his nommu insanity where he went out of his WAY to break
>> compile time probes, or needing to wrap syscalls myself in
>> https://github.com/landley/toybox/commit/833fb23fe8b4  because he decided he
>> didn't like a set of syscalls and _removed_ support musl had working fine at one
>> point...)
>>
>> If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc,
>> which would be very nice. And since it started life as a BSD extension (they're
>> the only man page _documenting_ it) I get support on FreeBSD and probably MacOS too.
>>
>> Cool, thanks,
>>
>> Rob
>>
>> On 5/6/19 10:56 AM, enh wrote:
>>> i think you might be able to use REG_STARTEND to avoid the implicit
>>> strlen: https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3&manpath=freebsd-release-ports
>>>
>>> REG_STARTEND
>>>    The string is considered to start at string +
>>>    pmatch[0].rm_so and to end before the byte located at
>>>    string + pmatch[0].rm_eo, regardless of the value of
>>>    nmatch.  See below for the definition of pmatch and nmatch.
>>>    This is an extension, compatible with but not specified by
>>>    IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
>>>    tion in software intended to be portable to other systems.
>>>
>>>    Without REG_NOTBOL, the position rm_so is considered the
>>>    beginning of a line, such that `^' matches before it, and
>>>    the beginning of a word if there is a word character at
>>>    this position, such that `[[:<:]]' and `\<' match before
>>>    it.
>>>
>>>    With REG_NOTBOL, the character at position rm_so is treated
>>>    as the continuation of a line, and if rm_so is greater than
>>>    0, the preceding character is taken into consideration.  If
>>>    the preceding character is a newline and the regular
>>>    expression was compiled with REG_NEWLINE, `^' matches
>>>    before the string; if the preceding character is not a word
>>>    character but the string starts with a word character,
>>>    `[[:<:]]' and `\<' match before the string.
>>>
>>> From: Rob Landley <rob at landley.net>
>>> Date: Sun, May 5, 2019 at 9:41 AM
>>> To: enh
>>> Cc: toybox
>>>
>>>> On 5/3/19 2:42 PM, enh wrote:
>>>>> On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
>>>>>>
>>>>>> On 5/3/19 1:56 PM, Rob Landley wrote:
>>>>>>> On 5/3/19 1:05 PM, enh wrote:
>>>>>>> But yeah, the new pessimal case after the change I'm making now would be a
>>>>>>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
>>>>>>> thing...
>>>>>>
>>>>>> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
>>>>>> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
>>>>>> know how much of the skipped data is "confirmed" and only copy it once we know
>>>>>> we're keeping it.
>>>>>
>>>>> yeah, that's what i meant by only ensuring that the first n characters
>>>>> are right.
>>>>>
>>>>>> So every kept byte is only copied once, and it can still be done in place for
>>>>>> the shrinking case. (You'll have to realloc when you expand, but that's probably
>>>>>> unavoidable...)
>>>>
>>>> Which I just did (not even trying in-place, everything gets copied once to a new
>>>> string), and it's still way too slow.
>>>>
>>>> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
>>>> time... Sigh. I forgot quite how much I hate perf, but:
>>>>
>>>>   68.60%  sed      sed                [.] regexec0
>>>>   29.93%  sed      [kernel.kallsyms]  [k] nmi
>>>>    1.47%  sed      libc-2.24.so       [.] strlen
>>>>    0.00%  sed      [kernel.kallsyms]  [k] __acct_update_integrals
>>>>
>>>> It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
>>>> the string before doing the regex each time (to figure out where the null to
>>>> skip to next is), that should go _after_ it tries to match this segment...
>>>>
>>>> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
>>>> dominating, but regexec0() is basically calling regexec() as the first thing and
>>>> returning immediately when it hits? The only thing that could be calling
>>>> strlen() is libc's regexec()? Hmmm...
>>>>
>>>> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>>>>          mlen, off, newlen;
>>>>
>>>>        // Loop finding match in remaining line (up to remaining len)
>>>> -      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>>>> +//      while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>>>> +while (!strncmp(rline, "x", 1)) {
>>>> +  match[0].rm_eo = 1;
>>>> +  match[1].rm_so = 0;
>>>> +
>>>>
>>>> And _that_ completes a megabyte input line basically instantly. It's regexec()
>>>> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
>>>> l1 cache that's taking up all the time now.
>>>>
>>>> Grrr. So I either have to write my own regex plumbing (which I can do, I've done
>>>> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
>>>> or I need to make my wrapper special case literal matches, which means they
>>>> don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
>>>> regex"...)
>>>>
>>>>   ^|.[*+?(\{$
>>>>
>>>> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
>>>> length and properly match NUL bytes. On the one hand that's really a post-1.0
>>>> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
>>>> it could probably speed up grep (I could go back to searching for multiple
>>>> regexes in parallel... What _is_ involved in implementing lex, anyway...)
>>>>
>>>> Rob
>>>
>