[Toybox] Slow grep

Rob Landley rob at landley.net
Wed Nov 29 10:10:56 PST 2023


On 11/29/23 11:05, enh wrote:
> sorry, looks like we forgot to follow up on this thread! from the
> internal bug (which was just closed out by a bot for inactivity)...

I know that song well enough to sing fairly complicated counter-harmony.

> """
> yeah the bucket sort optimization reduces the number of regex compares
> by around 2 orders of magnitude (there are around 100 printable ascii
> chars), which is "good enough" for [bug 1] I think.
> 
> However it doesn't help [bug 2] at all. Since toybox-grep uses the
> first character of a pattern as the bucket key, and the text file
> out/soong/hiddenapi/hiddenapi-flags.csv has each line start with an L,
> we hit the worse case scenario of bucket sort where all entries sort
> to the same bucket.
> """
> 
> if you're curious, that file looks something like this:
> ```
> Landroid/Manifest$permission;-><init>()V,public-api,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCEPT_HANDOVER:Ljava/lang/String;,public-api,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCESS_AMBIENT_CONTEXT_EVENT:Ljava/lang/String;,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCESS_AMBIENT_LIGHT_STATS:Ljava/lang/String;,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCESS_BACKGROUND_LOCATION:Ljava/lang/String;,public-api,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCESS_BLOBS_ACROSS_USERS:Ljava/lang/String;,public-api,sdk,system-api,test-api
> Landroid/Manifest$permission;->ACCESS_BROADCAST_RADIO:Ljava/lang/String;,sdk,system-api,test-api
> ```
> and is indeed quite large:
> ```
> ~/aosp-main-with-phones/out/soong/hiddenapi$ wc -l hiddenapi-flags.csv
> 606870 hiddenapi-flags.csv
> ```
> (full file available on request, if you want it!)

Um, yes please. Always best to test with real-world data. Do you have a file
you'd like to run it ON quickly as well?

I dunno what bug[2] is, and the "internal bug" isn't something I have a link
for. Thunderbird attached this as a reply to Yi-Yo Chiang's message on October
3, 2022 which doesn't have bug 1/2, and going back to the first message in the
thread:

http://lists.landley.net/pipermail/toybox-landley.net/2022-September/029242.html

That hasn't got a [2] either. I'm guessing it's "test with a very long list of
this sort of thing as the -F input file", and yeah, that sucks. Hmmm... looks
like the bucket sort needs to nest, doing a tree thing beyond a certain number
of entries sharing a large common prefix? Or just peel off the prefix and always
treat that as fixed, and sort the patterns by prefix so we can binary search.
(If first byte of each pattern is prefix length, if it's longer than 255 just
treat it as 255 and let the regex engine handle the rest...) Binary search has
to find FIRST entry with matching prefix and then traverse, which is gonna be
fun with variable shared prefix lengths. Some design work there...

I gotta go re-read commits a7e49c3c7860 and the fixes 193009855266 b26689f95065
fd8ef8cac003 5ed3c33417d3 to re-familiarize myself with what we were previously
optimizing...

For the new one, do you care about fixed patterns, regexes, or both?

> and even if you dropped the java internal class syntax, you'd still
> have most of the patterns start with 'a' for 'android' and the rest
> start with 'j' for 'java' :-)

For fixed patterns I can do a binary search. For regexes, the concept of a
binary search gets... squishy. Peeling off leading fixed runs from patterns and
doing a fixed-plus-regex thing at least kicks the can down the road to the NEXT
use case that's too slow.

Um, does pattern order matter ala -o and similar? I guess if you have two
patterns that can match here you take the longest one, but you only care when
you DO have -o and I doubt that's what we're optimizing for...? But I should
make it work right...

Rob


More information about the Toybox mailing list