[Toybox] Embedded NUL bytes in grep/sed, or "strings are hard".

Owen Shepherd owen.shepherd at e43.eu
Tue Sep 30 12:02:54 PDT 2014


Rob Landley wrote:
> In theory I can implement my own get_line() on top of FILE * using fgetc,
> but this is again looping over single bytes (because with ungetc only one
> pushback is guaranteed). A function call is cheaper than
> a system call, but still not exactly ideal. Unfortunately, I can't ask stdio
> "how many bytes of readahead are in your internal buffer" because it wants to
> hide those details. (Under strace, most actual fgetc() loops I actually
> watched did the darn one syscall per byte thing anyway.)
Is the file/stdin appropriately buffered? (i.e. is your implementation 
being conservative and making stdin _IONBF for no good reason?)

More concretely: what libc was this tested with? If uclibc, I'm inclined 
to believe uclibc is a pile of crap. If musl, WTF.

glibc gets this right, FWIW:
oshepherd at Shinji:~$ cat testbuf.c
#include <stdio.h>

int main()
{
     int c;
     while((c = fgetc(stdin)) != EOF)
         fputc(c, stdout);
     return 1;
}
oshepherd at Shinji:~$ strace ./testbuf < testbuf.c
execve("./testbuf", ["./testbuf"], [/* 21 vars */]) = 0
/* dynamic linker noise excised */
read(0, "#include <stdio.h>\n\nint main()\n{"..., 4096) = 123
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f93b7e9b000
write(1, "#include <stdio.h>\n", 19#include <stdio.h>
)    = 19
write(1, "\n", 1
)                       = 1
write(1, "int main()\n", 11int main()
)            = 11
write(1, "{\n", 2{
)                      = 2
write(1, "    int c;\n", 11    int c;
)            = 11
write(1, "    while((c = fgetc(stdin)) != "..., 37    while((c = 
fgetc(stdin)) != EOF)
) = 37
write(1, "        fputc(c, stdout);\n", 26        fputc(c, stdout);
) = 26
write(1, "    return 1;\n", 14    return 1;
)         = 14
write(1, "}\n", 2}
)                      = 2
read(0, "", 4096)                       = 0
exit_group(1)                           = ?

For best performance, make sure that stdin is fully buffered and then

 1. flockfile(stdin), because POSIX says to do so
 2. Use getc_unlocked, which may be a macro, and should be the fastest
    way to grab a character

The cost of all those function calls should be much less than the cost 
of a system call per line, especially if you give stdio a big buffer to 
work with. Whatever you do, give stdio a big buffer

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20140930/d93002f4/attachment-0004.htm>


More information about the Toybox mailing list