[Toybox] Slow grep

enh enh at google.com
Fri Sep 16 10:30:21 PDT 2022


On Fri, Sep 16, 2022 at 7:47 AM Yi-yo Chiang <yochiang at google.com> wrote:

>
>
> On Fri, Sep 16, 2022 at 1:20 PM Rob Landley <rob at landley.net> wrote:
>
>> On 9/15/22 16:32, enh wrote:
>> > On Thu, Sep 15, 2022 at 1:45 PM Rob Landley <rob at landley.net
>> > <mailto:rob at landley.net>> wrote:
>> >
>> >     On 9/15/22 07:30, Yi-yo Chiang via Toybox wrote:
>> >     > grep is slow when the number of patterns is large.
>> >     ...
>> >     >   xxd -p -c 0 -l 40 /dev/urandom
>> >
>> >     Huh, WHY does that produce two lines of output?
>> >
>> >     $ xxd -p -c 0 -l 40 /dev/urandom
>> >     1fcf13e1b4844ba209fb9958bde26a13577c577744f1b1290240d03f4f8e
>> >     644fd0687c39b1aa8a68
>> >
>> >     Behavior is consistent between toybox xxd and debian's, but Elliott
>> sent in the
>> >     xxd implementation and I don't use it much, so... Feeding in -c 40
>> and -c 80
>> >     make no difference?
>> >
>> > i think that's actually a bug caused by this:
>> >
>> >   // Plain style is 30 bytes/line, no grouping.
>> >
>> >   if (FLAG(p)) TT.c = TT.g = 30;
>> >
>> > should presumably be
>> >
>> >   // Plain style is 30 bytes/line, no grouping.
>> >   if (FLAG(p)) {
>> >     if (!TT.c) TT.c = 30;
>> >     if (!TT.g) TT.g = 30;
>> >   }
>> >
>> > ?
>>
>> Except we didn't set -g so it would still be set to 30, which is going to
>> stick
>> spaces into the output.
>>
>> And xxd_main() starts with if (!TT.c) TT.c = blah so it would never be
>> zero at
>> that point unless we reorder the code, and then once THAT'S fixed -c 0 is
>> still
>> !TT.c and if I switch that to if (FLAG(c)) to allow c = 0 through (which
>> the
>> range in the optstr is allowing) it's used to cap the length in readall()
>> meaning the first read becomes EOF so no output gets produced.
>>
>> > certainly "real" xxd works for me on macos and debian, both of which
>> have the
>> > same version of xxd:
>> >
>> > ~$ xxd --version
>> > xxd 2022-01-14 by Juergen Weigert et al.
>> > ~$ xxd -p -c 0 -l 40 /dev/urandom
>> >
>> ac160632955aa9d938e60d3533cbcf0febb4decdd12f130e415913ff1fe6e2abcaf7c4a8e980de7a
>>
>> See, this is extra weird: nothing set -g so it should default to 2.
>> Somehow it
>> knows to set itself to... I'm guessing 0. Did -p -c 0 get special cased,
>> or did
>> -p change its default to avoid any breaks even without the -c 0? (Sounds
>> like
>> the latter is more likely, but I tried "yum install xxd" on my fedora 36
>> VM and
>> yum doesn't know what an xxd is.
>>
>>
> (Tangent. Seems like xxd behavior is different everywhere, I should be
> careful and also note my expected output next time I use xxd as a random
> string generator.)
>
> On my debian machine (with the same 2022-01-14 build as Elliott), `man
> xxd` says...
> * -p: Plain text output, '-g' is ignored, '-c' defaults to 30, '-c 0'
> results in one long line (yes '-p -c 0' is a special case for "plain text
> no wraping no grouping")
>

so replace

  if (!TT.c) TT.c = FLAG(i) ? 12 : 16;

  // Plain style is 30 bytes/line, no grouping.
  if (FLAG(p)) TT.c = TT.g = 30;

with

  if (FLAG(p)) {
    // Plain style is 30 bytes/line, no grouping.
    // -p -c 0 is a special case for "one long line".
    if (!TT.c) TT.c = FLAG(c) ? LONG_MAX : 30;
    TT.g = TT.c;
  }
  if (!TT.c) TT.c = FLAG(i) ? 12 : 16;

?

rob: let me know if you've _not_ been distracted by this already and i'll
actually test that and send you a patch this afternoon, since it's friday!


> * doesn't say what would happen with '-c 0' otherwise
>
>
>> > ah, but on another box with 2021-10-22 it's broken. so it looks like
>> "real" xxd
>> > had the same bug and fixed it recently?
>>
>> Eh, seems more like a design decision than a bug. Before -p was
>> wordwrapping the
>> hexdump output and now it isn't. I dunno if it always isn't, or just with
>> -c 0?
>> We didn't set -g and it has a nonzero default value (1, 2, or 4 depending
>> on
>> barometric pressure)...
>>
>> I also note that the man page says -g 0 switches off grouping, but does
>> NOT say
>> that -c 0 switches off columns? In the V1.10 version I have installed, -c
>> 0
>> seems to be a NOP:
>>
>> $ sha1sum < /dev/null | xxd -c 0
>> 00000000: 6461 3339 6133 6565 3565 3662 3462 3064  da39a3ee5e6b4b0d
>> 00000010: 3332 3535 6266 6566 3935 3630 3138 3930  3255bfef95601890
>> 00000020: 6166 6438 3037 3039 2020 2d0a            afd80709  -.
>>
>> Once again: easy to change the behavior, hard to tell what the changed
>> behavior
>> should be. Easiest is to have -p force -g to 0 and -c to huge (stomping
>> whatever
>> else got set in both). I could also teach -c that 0 means infinite (well,
>> sizeof(toybuf) implementation limit which is still bigger than the 256
>> directly
>> settable limit that I have no idea why it's there) if that's actually a
>> thing...?
>>
>> (Grumble grumble no standard and the reference implementation has version
>> skew...)
>>
>> Rob
>>
>
>
> --
>
> Yi-yo Chiang
> Software Engineer
> yochiang at google.com
>
> I support flexible work schedules, and I’m sending this email now because
> it is within the hours I’m working today. Please do not feel obliged to
> reply straight away - I understand that you will reply during the hours you
> work, which may not match mine.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220916/a7bc3ad6/attachment.htm>


More information about the Toybox mailing list