[Toybox] pathological case in sed s///g
Rob Landley
rob at landley.net
Wed May 8 22:08:31 PDT 2019
I emailed Michael Kerrisk to update the man page to mention it...
Rob
On 5/8/19 7:52 PM, enh wrote:
> (i checked and macOS does have both REG_STARTEND and REG_PEND.)
>
> From: Rob Landley <rob at landley.net>
> Date: Mon, May 6, 2019 at 10:42 AM
> To: enh
> Cc: toybox, Rich Felker
>
>> Huh... I'll assume REG_STARTEND works in bionic since you're pointing me at it.
>> It's not in the regex man page but it _is_ in the glibc headers...
>>
>> https://github.com/bminor/glibc/commit/6fefb4e0b16
>>
>> Looks like it went into glibc in 2004, which is way past 7 years. I should poke
>> Michael Kerrisk to update the man page.
>>
>> And musl explicitly refused to do it, but Rich makes bad calls all the time:
>>
>> https://www.openwall.com/lists/musl/2013/01/15/26
>>
>> I still haven't got an #ifdef __MUSL__ in portability because Rich insists his
>> libc is the only perfect piece of software ever written, but I can probe for it
>> at compile time and #define 0 to stub out. (Unlike needing a manual config
>> symbol to work around his nommu insanity where he went out of his WAY to break
>> compile time probes, or needing to wrap syscalls myself in
>> https://github.com/landley/toybox/commit/833fb23fe8b4 because he decided he
>> didn't like a set of syscalls and _removed_ support musl had working fine at one
>> point...)
>>
>> If this _does_ match NUL bytes it lets me remove regexec0() entirely from libc,
>> which would be very nice. And since it started life as a BSD extension (they're
>> the only man page _documenting_ it) I get support on FreeBSD and probably MacOS too.
>>
>> Cool, thanks,
>>
>> Rob
>>
>> On 5/6/19 10:56 AM, enh wrote:
>>> i think you might be able to use REG_STARTEND to avoid the implicit
>>> strlen: https://www.freebsd.org/cgi/man.cgi?query=regex&sektion=3&manpath=freebsd-release-ports
>>>
>>> REG_STARTEND
>>> The string is considered to start at string +
>>> pmatch[0].rm_so and to end before the byte located at
>>> string + pmatch[0].rm_eo, regardless of the value of
>>> nmatch. See below for the definition of pmatch and nmatch.
>>> This is an extension, compatible with but not specified by
>>> IEEE Std 1003.2 (``POSIX.2''), and should be used with cau-
>>> tion in software intended to be portable to other systems.
>>>
>>> Without REG_NOTBOL, the position rm_so is considered the
>>> beginning of a line, such that `^' matches before it, and
>>> the beginning of a word if there is a word character at
>>> this position, such that `[[:<:]]' and `\<' match before
>>> it.
>>>
>>> With REG_NOTBOL, the character at position rm_so is treated
>>> as the continuation of a line, and if rm_so is greater than
>>> 0, the preceding character is taken into consideration. If
>>> the preceding character is a newline and the regular
>>> expression was compiled with REG_NEWLINE, `^' matches
>>> before the string; if the preceding character is not a word
>>> character but the string starts with a word character,
>>> `[[:<:]]' and `\<' match before the string.
>>>
>>> From: Rob Landley <rob at landley.net>
>>> Date: Sun, May 5, 2019 at 9:41 AM
>>> To: enh
>>> Cc: toybox
>>>
>>>> On 5/3/19 2:42 PM, enh wrote:
>>>>> On Fri, May 3, 2019 at 11:59 AM Rob Landley <rob at landley.net> wrote:
>>>>>>
>>>>>> On 5/3/19 1:56 PM, Rob Landley wrote:
>>>>>>> On 5/3/19 1:05 PM, enh wrote:
>>>>>>> But yeah, the new pessimal case after the change I'm making now would be a
>>>>>>> megabyte of xyxyxyxy with 's/xy/x/g' _THAT_ would need the one output buffer
>>>>>>> thing...
>>>>>>
>>>>>> And it can be avoided by an in-place copy that remembers "last gap" and doesn't
>>>>>> copy trailing data until the _next_ hit (or we confirm there's no hit) so we
>>>>>> know how much of the skipped data is "confirmed" and only copy it once we know
>>>>>> we're keeping it.
>>>>>
>>>>> yeah, that's what i meant by only ensuring that the first n characters
>>>>> are right.
>>>>>
>>>>>> So every kept byte is only copied once, and it can still be done in place for
>>>>>> the shrinking case. (You'll have to realloc when you expand, but that's probably
>>>>>> unavoidable...)
>>>>
>>>> Which I just did (not even trying in-place, everything gets copied once to a new
>>>> string), and it's still way too slow.
>>>>
>>>> Time to dig out https://jvns.ca/perf-zine.pdf and see where it's spending its
>>>> time... Sigh. I forgot quite how much I hate perf, but:
>>>>
>>>> 68.60% sed sed [.] regexec0
>>>> 29.93% sed [kernel.kallsyms] [k] nmi
>>>> 1.47% sed libc-2.24.so [.] strlen
>>>> 0.00% sed [kernel.kallsyms] [k] __acct_update_integrals
>>>>
>>>> It's regexec0(). Hmmmm... ah, my NUL wrapper is essentially calling strlen() on
>>>> the string before doing the regex each time (to figure out where the null to
>>>> skip to next is), that should go _after_ it tries to match this segment...
>>>>
>>>> Ok, and now that I've fixed that it's _still_ taking forever and strlen() is
>>>> dominating, but regexec0() is basically calling regexec() as the first thing and
>>>> returning immediately when it hits? The only thing that could be calling
>>>> strlen() is libc's regexec()? Hmmm...
>>>>
>>>> @@ -473,11 +473,15 @@ static void sed_line(char **pline, long plen)
>>>> mlen, off, newlen;
>>>>
>>>> // Loop finding match in remaining line (up to remaining len)
>>>> - while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>>>> +// while (!regexec0(reg, rline, len-(rline-line), 10, match, mflags)) {
>>>> +while (!strncmp(rline, "x", 1)) {
>>>> + match[0].rm_eo = 1;
>>>> + match[1].rm_so = 0;
>>>> +
>>>>
>>>> And _that_ completes a megabyte input line basically instantly. It's regexec()
>>>> calling strlen() each time pulling (n^2)/2 bytes (half a terabyte?) through the
>>>> l1 cache that's taking up all the time now.
>>>>
>>>> Grrr. So I either have to write my own regex plumbing (which I can do, I've done
>>>> it before years ago, but I really didn't _want_ to here because LIBC HAS ONE),
>>>> or I need to make my wrapper special case literal matches, which means they
>>>> don't have... let's see... (For the honor of <strike>greyskull</strike> "man 7
>>>> regex"...)
>>>>
>>>> ^|.[*+?(\{$
>>>>
>>>> Sigh. If I _was_ going to write my own regex plumbing I could make it take a
>>>> length and properly match NUL bytes. On the one hand that's really a post-1.0
>>>> todo item. On the other hand we're talking maybe 200 lines of code here, _and_
>>>> it could probably speed up grep (I could go back to searching for multiple
>>>> regexes in parallel... What _is_ involved in implementing lex, anyway...)
>>>>
>>>> Rob
>>>
>
More information about the Toybox
mailing list