[Toybox] awk seen in the wild

Andy Chu andychup at gmail.com
Thu Jul 21 01:28:12 PDT 2016


> Kernighan Awk has its own regex implementation "b.c" in 958 lines, and
> there is an argument to keep it.  It uses the Thompson linear-time
> NFA/DFA algorithm rather than exponential backtracking.  See the note
> here:

Never mind about this tangent... I *think* GNU libc actually uses the
linear time algorithm, with a possible exception for backreferences.
I was in the middle of some research on that but didn't finish (musl
libc uses a fork of the TRE regex engine, etc.).

But oddly, GNU grep, awk, sed, and coreutils all have a copy of the
GNU libc regex engine?  That is just annoying.

$ wc -l gawk-*/reg* */lib/reg*.[ch] | sort -n
     81 coreutils-8.22/lib/regex.c
     81 grep-2.24/lib/regex.c
     81 sed-4.2.2/lib/regex.c
     85 gawk-4.1.3/regex.c
    591 gawk-4.1.3/regex.h
    664 grep-2.24/lib/regex.h
    667 coreutils-8.22/lib/regex.h
    668 sed-4.2.2/lib/regex.h
    834 gawk-4.1.3/regex_internal.h
    868 sed-4.2.2/lib/regex_internal.h
    910 coreutils-8.22/lib/regex_internal.h
    912 grep-2.24/lib/regex_internal.h
   1742 grep-2.24/lib/regex_internal.c
   1744 sed-4.2.2/lib/regex_internal.c
   1746 coreutils-8.22/lib/regex_internal.c
   1759 gawk-4.1.3/regex_internal.c
   3927 coreutils-8.22/lib/regcomp.c
   3941 sed-4.2.2/lib/regcomp.c
   3958 gawk-4.1.3/regcomp.c
   3962 grep-2.24/lib/regcomp.c
   4391 gawk-4.1.3/regexec.c
   4412 grep-2.24/lib/regexec.c
   4418 coreutils-8.22/lib/regexec.c
   4421 sed-4.2.2/lib/regexec.c

Andy


More information about the Toybox mailing list