[Toybox] Slow grep

enh enh at google.com
Wed Nov 29 09:05:03 PST 2023


sorry, looks like we forgot to follow up on this thread! from the
internal bug (which was just closed out by a bot for inactivity)...
"""
yeah the bucket sort optimization reduces the number of regex compares
by around 2 orders of magnitude (there are around 100 printable ascii
chars), which is "good enough" for [bug 1] I think.

However it doesn't help [bug 2] at all. Since toybox-grep uses the
first character of a pattern as the bucket key, and the text file
out/soong/hiddenapi/hiddenapi-flags.csv has each line start with an L,
we hit the worse case scenario of bucket sort where all entries sort
to the same bucket.
"""

if you're curious, that file looks something like this:
```
Landroid/Manifest$permission;-><init>()V,public-api,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCEPT_HANDOVER:Ljava/lang/String;,public-api,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCESS_AMBIENT_CONTEXT_EVENT:Ljava/lang/String;,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCESS_AMBIENT_LIGHT_STATS:Ljava/lang/String;,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCESS_BACKGROUND_LOCATION:Ljava/lang/String;,public-api,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCESS_BLOBS_ACROSS_USERS:Ljava/lang/String;,public-api,sdk,system-api,test-api
Landroid/Manifest$permission;->ACCESS_BROADCAST_RADIO:Ljava/lang/String;,sdk,system-api,test-api
```
and is indeed quite large:
```
~/aosp-main-with-phones/out/soong/hiddenapi$ wc -l hiddenapi-flags.csv
606870 hiddenapi-flags.csv
```
(full file available on request, if you want it!)

and even if you dropped the java internal class syntax, you'd still
have most of the patterns start with 'a' for 'android' and the rest
start with 'j' for 'java' :-)

On Mon, Oct 3, 2022 at 3:03 AM Rob Landley <rob at landley.net> wrote:
>
> On 10/3/22 03:43, Yi-yo Chiang wrote:
> > On Thu, Sep 29, 2022 at 1:20 PM Rob Landley <rob at landley.net
> > <mailto:rob at landley.net>> wrote:
> >
> >     All '^' and '$' do is say the zero length match occurs just once at the start or
> >     end of every line, but the -o logic discards all zero length matches and the
> >     non-o logic just cares that there IS a match. UNLESS you're producing colored
> >     output, but the colored output mostly piggybacks on -o. There is a sort of
> >     terrible corner case though, which oddly enough makes this corner case visible!
> >
> >     $ echo potato | grep --color=always '' | hd
> >     00000000  70 6f 74 61 74 6f 0a                              |potato.|
> >     00000007
> >     $ echo potato | toybox grep --color=always '' | hd
> >     00000000  1b 5b 6d 1b 5b 31 3b 33  31 6d 1b 5b 6d 70 6f 74  |.[m.[1;31m.[mpot|
> >     00000010  61 74 6f 0a                                       |ato.|
> >     00000014
> >     $ echo potato | toybox grep --color=always '$' | hd
> >     00000000  1b 5b 6d 70 6f 74 61 74  6f 1b 5b 31 3b 33 31 6d  |.[mpotato.[1;31m|
> >     00000010  1b 5b 6d 0a                                       |.[m.|
> >     00000014
> >
> >     Yeah, I have a todo item to try to optimize the color escape generation for
> >     toybox but last time I sat down and looked at it I just didn't have the spoons.
> >
> > I have some optimizations related to color output, not sure if these are what
> > you had in mind?
>
> NI was just talking about making it not emit unnecessary color escapes that do
> nothing but cancel each other out.
>
> > * If not showing color (no --color) or not showing matched part (-v) and (-o,
> > -w, -x) are not given, then call regcomp() with REG_NOSUB. (could potentially
> > improve the compiled regex)
> > * If regex pattern was compiled with REG_NOSUB (we are not printing/highlighting
> > matched part), then we can stop pattern matching as soon as we found first hit
> > (no need to loop through all patterns to find longest match nor all matches)
> > (break this loop early
> > https://github.com/landley/toybox/blob/b17dc8e111dd408db04ea7ae70c410fd3054a751/toys/posix/grep.c#L218
>
> *shrug* Sure, sounds like a good idea.
>
> > I can send you these patches after I rebase my local tree on top of the recent
> > bucket sort optimizations.
> >
> >     > that a simplified case based on the real-world build break caused by
> >     ...
> >     > merging zero jar files into one instead of all the jar files :-)
> >
> >     Acknowledged. Try commit 193009855266?
> >
> >
> > Still had problems when matching Fixed patterns:
> >
> > $ echo '\.zip' | ./old-toybox grep -F '\.zip'
> > \.zip
> > $ echo '\.zip' | ./toybox grep -F '\.zip'
> > $
> >
> > I think what's happening is the fixed string pattern '\.zip' gets put into the
> > '.' bucket (should be '\' bucket instead?), thus '\.zip' never matches anything.
>
> Sigh, you're right. Try commit b26689f95065
>
> Rob


More information about the Toybox mailing list