[Toybox] Slow grep

enh enh at google.com
Wed Nov 29 10:41:12 PST 2023


On Wed, Nov 29, 2023 at 10:05 AM Rob Landley <rob at landley.net> wrote:
>
> On 11/29/23 11:05, enh wrote:
> > sorry, looks like we forgot to follow up on this thread! from the
> > internal bug (which was just closed out by a bot for inactivity)...
>
> I know that song well enough to sing fairly complicated counter-harmony.
>
> > """
> > yeah the bucket sort optimization reduces the number of regex compares
> > by around 2 orders of magnitude (there are around 100 printable ascii
> > chars), which is "good enough" for [bug 1] I think.
> >
> > However it doesn't help [bug 2] at all. Since toybox-grep uses the
> > first character of a pattern as the bucket key, and the text file
> > out/soong/hiddenapi/hiddenapi-flags.csv has each line start with an L,
> > we hit the worse case scenario of bucket sort where all entries sort
> > to the same bucket.
> > """
> >
> > if you're curious, that file looks something like this:
> > ```
> > Landroid/Manifest$permission;-><init>()V,public-api,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCEPT_HANDOVER:Ljava/lang/String;,public-api,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCESS_AMBIENT_CONTEXT_EVENT:Ljava/lang/String;,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCESS_AMBIENT_LIGHT_STATS:Ljava/lang/String;,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCESS_BACKGROUND_LOCATION:Ljava/lang/String;,public-api,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCESS_BLOBS_ACROSS_USERS:Ljava/lang/String;,public-api,sdk,system-api,test-api
> > Landroid/Manifest$permission;->ACCESS_BROADCAST_RADIO:Ljava/lang/String;,sdk,system-api,test-api
> > ```
> > and is indeed quite large:
> > ```
> > ~/aosp-main-with-phones/out/soong/hiddenapi$ wc -l hiddenapi-flags.csv
> > 606870 hiddenapi-flags.csv
> > ```
> > (full file available on request, if you want it!)
>
> Um, yes please. Always best to test with real-world data. Do you have a file
> you'd like to run it ON quickly as well?

from the bug, it looks like this was the command:
```
time grep -F -f <(cut -f1 -d, filtered-stub-flags.csv
hiddenapi-flags.csv >/dev/null
```
i'll send you gzip'ed copies of the two files, but since they're 5MiB
compressed, i'll not bother the list with that.

> I dunno what bug[2] is, and the "internal bug" isn't something I have a link
> for. Thunderbird attached this as a reply to Yi-Yo Chiang's message on October
> 3, 2022 which doesn't have bug 1/2, and going back to the first message in the
> thread:
>
> http://lists.landley.net/pipermail/toybox-landley.net/2022-September/029242.html
>
> That hasn't got a [2] either. I'm guessing it's "test with a very long list of
> this sort of thing as the -F input file", and yeah, that sucks.

yeah, sorry, i was trying to _not_ waste your time with internal bug
links that won't work for you. that obviously didn't go as planned!

> Hmmm... looks
> like the bucket sort needs to nest, doing a tree thing beyond a certain number
> of entries sharing a large common prefix? Or just peel off the prefix and always
> treat that as fixed, and sort the patterns by prefix so we can binary search.
> (If first byte of each pattern is prefix length, if it's longer than 255 just
> treat it as 255 and let the regex engine handle the rest...) Binary search has
> to find FIRST entry with matching prefix and then traverse, which is gonna be
> fun with variable shared prefix lengths. Some design work there...
>
> I gotta go re-read commits a7e49c3c7860 and the fixes 193009855266 b26689f95065
> fd8ef8cac003 5ed3c33417d3 to re-familiarize myself with what we were previously
> optimizing...
>
> For the new one, do you care about fixed patterns, regexes, or both?

afaik the only cases of people doing anything like this are all fgrep.
(iirc that's what we saw in BSD grep too --- it had these kinds of
optimizations for the fgrep case, but not the regex case?)

> > and even if you dropped the java internal class syntax, you'd still
> > have most of the patterns start with 'a' for 'android' and the rest
> > start with 'j' for 'java' :-)
>
> For fixed patterns I can do a binary search. For regexes, the concept of a
> binary search gets... squishy. Peeling off leading fixed runs from patterns and
> doing a fixed-plus-regex thing at least kicks the can down the road to the NEXT
> use case that's too slow.
>
> Um, does pattern order matter ala -o and similar? I guess if you have two
> patterns that can match here you take the longest one, but you only care when
> you DO have -o and I doubt that's what we're optimizing for...? But I should
> make it work right...
>
> Rob


More information about the Toybox mailing list