[Toybox] Embedded NUL bytes in grep/sed, or "strings are hard".

Tue Sep 30 06:22:25 PDT 2014

The recent request to make grep work with embedded nul bytes (which impacts
the design of sed and probably other stuff) is... tricky. Because line
buffering is a nightmare.

The problem is that while posix-2008 getline() will read an entire line to the
next newline, returning embedded NUL bytes if any, it _won't_ tell you the
length of the line it read. It returns a length, but that's the length of the
allocated buffer, which is rounded up to who knows what and the extra data
not necessarily zeroed.

In theory we could deal with this by reading forward until we hit a newline,
of which there should be exactly one... except it isn't guaranteed to return
a newline, because the last line of the file won't necessarily have a newline.
So "line did not end with a newline yet" doesn't guarantee you'll find one
after the NUL byte, although the spurious garbage there may have one.

In theory you could read one more line and check the return code to see if
we hit end of file, but even if you d that you still can't handle a NUL byte
in the last line because you don't know where it ends. What do you do with
a file that ends with three NUL bytes? Or a newline in the trailing garbage
on a line that didn't otherwise have one?

The reason to use readline() instead of lib/lib.c read_line() is speed: doing
a syscall for each byte is really slow, but it's the only way to control
the input. Reading multiple bytes at a time doesn't let you put them _back_
into a pipe. So if you have a shell command that should eat a specific amount
of stdin and leave the rest for future commands, you can't put back any
readahead that turned out not to be for you.

If we assume that terminal "cooked" mode will take care of that for us
(by sizing reads so each short read returns a line at a time), we still have
the problem of retaining extra data between lines when "command < file" is
giving us data. So we need an overflow buffer attached to the input stream,
and the obvious thing to do there is use FILE * (which exists to attach
an overflow buffer to an input stream).

Which gets us back to why we were using getline() in the first place,
and the problems with that.

This is why I was leaning on scanf so hard, its %n tells us how many bytes
of input it actually _read_. Except that the internationalization idiots
broke the existing semantics so if you aren't in the C locale it tells you
the number of CHARACTERS read, not the number of BYTES. And grep and sed
are some things that care about locale (case insensitivity).

In theory I can implement my own get_line() on top of FILE * using fgetc,
but this is again looping over single bytes (because with ungetc only one
pushback is guaranteed). A function call is cheaper than
a system call, but still not exactly ideal. Unfortunately, I can't ask stdio
"how many bytes of readahead are in your internal buffer" because it wants to
hide those details. (Under strace, most actual fgetc() loops I actually
watched did the darn one syscall per byte thing anyway.)

I think ftell() might work since it should tell the position of the next byte
it would deliver, not the actual readahead. But I can't guarantee it won't
be implemented as a call to lseek() so it won't work on pipes (-EBADF).

If I know this file will _only_ have lines in it, and will be read to the
end (or at least we don't care about data after the last line of interest)
then I can implement my own buffered "line = get_line(fd, buf, buflen);"
that stores the leftover data. Which is probably the right thing to do,
but not something to hold the pending release for...

Speaking of which, October 1 is coming up and that's a logical time to have
the LONG-DELAYED release. Which I'm thinking of calling 0.5.0 just because.
(Working on release notes now.)

Rob