[Toybox] New toy: grep

Rob Landley rob at landley.net
Mon Feb 27 21:39:22 PST 2012


On 02/27/2012 02:45 PM, Tim Elliott wrote:
> On Mon, Feb 27, 2012 at 12:35 PM, Andre Renaud <andre at bluewatersys.com> wrote:
>> Regarding the greps over binaries, and arbitrary length buffers. I'm
>> curious what kind of implementation you'd do there to avoid having
>> issues with expressions that sit on block boundaries, or regular
>> expressions that have a possible infinite length match, such as 'a.*b'.
>> Is it realistic to expect the entire file to be in memory for such a
>> regexp? I suppose mmap could be used, but isn't that a bit heavy-handed?
> 
> This post has some pointers non-regex string searching and mmap in grep:
> http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
> 
> Since toybox prioritizes simplicity over performance, it may or may
> not end up being useful.

I'm pretty happy to call this libc's problem, at least until I have to
implement a regex engine. :)

Any sane modern processor is probably going to be I/O limited. The
common case is actually page sized or smaller copies, and for those it's
all cache local. Linus Torvalds had a good post about this a decade or
so back.  Let's see... maybe this?

http://lkml.indiana.edu/hypermail/linux/kernel/0004.0/0728.html
http://lkml.indiana.edu/hypermail/linux/kernel/0004.0/0775.html

I vaguely recall something about page sized copies being a sweet spot...
 I also remember, over and over through the years, optimizing for one
generation of processor being a waste on another generation. Hardware
engineers have spent the past 40 years of microchip development
optimizing for code that does the obvious thing.

The skip-forward lookup table thing is nice (except for the part where
your lookup table itself might eating a decent chunk of L1 cache if
you're not careful; I very vaguely remember bumping into this years
ago).  The worrying about lines retroactively is also nice.  Maybe those
two would be worth doing someday. But being able to use standard
get_line() and regex() code instead of having to write something just
for grep is a big plus too.

Rob

 1330407562.0


More information about the Toybox mailing list