[Toybox] Embedded NUL bytes in grep/sed, or "strings are hard".

Rob Landley rob at landley.net
Sun Oct 5 10:46:35 PDT 2014


On 09/30/14 14:02, Owen Shepherd wrote:
> Rob Landley wrote:
>> In theory I can implement my own get_line() on top of FILE * using fgetc,
>> but this is again looping over single bytes (because with ungetc only one
>> pushback is guaranteed). A function call is cheaper than
>> a system call, but still not exactly ideal. Unfortunately, I can't ask stdio
>> "how many bytes of readahead are in your internal buffer" because it wants to
>> hide those details. (Under strace, most actual fgetc() loops I actually
>> watched did the darn one syscall per byte thing anyway.)
>
> Is the file/stdin appropriately buffered? (i.e. is your implementation
> being conservative and making stdin _IONBF for no good reason?)

I very much want this to be libc's problem, not mine. That's the main
reason to use FILE *.

> More concretely: what libc was this tested with? If uclibc, I'm inclined
> to believe uclibc is a pile of crap. If musl, WTF.

I believe I looked at uClibc and glibc both, but it was a while ago. (As
in several years.)

> glibc gets this right, FWIW:
> oshepherd at Shinji:~$ cat testbuf.c
> #include <stdio.h>
> 
> int main()
> {
>     int c;
>     while((c = fgetc(stdin)) != EOF)
>         fputc(c, stdout);
>     return 1;
> }
> oshepherd at Shinji:~$ strace ./testbuf < testbuf.c
> execve("./testbuf", ["./testbuf"], [/* 21 vars */]) = 0
> /* dynamic linker noise excised */
> read(0, "#include <stdio.h>\n\nint main()\n{"..., 4096) = 123

For output it was using newlines to flush the buffer. For input it was
doing single bytes. Good to see that's changed, I guess...

> For best performance, make sure that stdin is fully buffered and then
> 
>  1. flockfile(stdin), because POSIX says to do so
>  2. Use getc_unlocked, which may be a macro, and should be the fastest
>     way to grab a character

See "want this to be libc's problem".

Getting the block size right is 99% of optimizing this sort of thing.
The rest is details. (Maddog had a marvelous talk about this at
LinuxWorld Expo in 2001. I have a tape of it somewhere, but alas it's on
casette and I no longer have a player for that.)

> The cost of all those function calls should be much less than the cost
> of a system call per line, especially if you give stdio a big buffer to
> work with. Whatever you do, give stdio a big buffer

I don't want to micromanage stdio's buffer size. That's libc's job. Glad
to hear it's doing a better job of it than it was circa 2008.

Rob

 1412531195.0


More information about the Toybox mailing list