[Toybox] toybox - added cmp

Mon Feb 13 20:34:46 PST 2012

On 02/09/2012 11:00 AM, Frank Bergmann wrote:
> Hi.
> 
> On Thu, Feb 09, 2012 at 05:46:50AM -0600, Rob Landley wrote:
>> When get_line() reads data, it can't know where the terminator character
>> occurs ahead of time.  So either it makes sure not to read past the end
>> of the line (by reading one character at a time), or it has to store the
>> beginning of the line somewhere until next time we need it.
>>
>> The problem is, "next time we need it" is not well-defined.  Think about
>> xargs' -E option, specifying the end of file string.  Can xargs then
>> pass through the rest of stdin to the program it execs?  Not if it
>> already read an arbitrary-sized chunk of it into an internal buffer so
>> further reads from the filehandle start who-knows-where.
> 
> Can you please explain it more in detail? Do you talk about xargs itself
> or the command(s) it will execute?

Ok, suppose I want to extend xargs so that when it hits the end-of-input
indicator (-E BLAH) it can then pass the rest of its' input through to
the stdin of the command it's calling.

If I'm using a readline() that actually stops when it hits a newline
character, then I can do this.  But if my readline() is being remotely
efficient and doing a "man 2 read" system call with a buffer size
greater than 1 byte, then it's most likely going to read an arbitrary
number of bytes past that newline character.

We can't feed those bytes _back_ into stdin.  They exist in a memory
buffer within xargs(), but if xargs does an exec() of a child they go
away.  What xargs would have to do to feed them to the child is:

A) Create a "man 2 pipe".
B) Fork the child, have the child close stdin and dup() the receiving
end of the pipe to stdin.
C) Have the parent write the "extra" bytes to the pipe, and then set up
a read/write loop forwarding the rest of the data from its' stdin to the
write side of the pipe.

(This is why xargs children reading from stdin get undefined results,
even with the -E option, because doing this is ridiculous.)

The problem is roughly the same when it's just a function: if another
function elsewhere in the program wants to read() from fd 0 after a
readline that wasn't careful about not reading _past_ the end of the
line, it'll miss data.  Those other functions thus have to be aware of
the existence of whatever overflow buffer you've got and how to use it,
_or_ your readline() function has to read one byte at a time so it never
goes past the end of the line character.

> If xargs uses buffered input it can use its buffer as long as the process
> exists (and not something stupid like fclose(stdin) was done).
> This reminds me stdio. Please look at this example:

The ANSI guys added FILE * so they'd have the buffer handling in the
library itself.  But if you've got existing functions that use a file
descriptor instead of a FILE * they don't necessarily mix cleanly due to
this issue, and FILE * results in bigger code than fd.  (For one thing,
the ansi fread/fwrite functions are REALLY STUPID, requiring two
arguments which it multiplies together to get the length, for no
apparent reason.)

> [fwb at vdr toybox]$ cat example-setvbuf.c 
> #include <unistd.h>
> #include <stdio.h>
> #define BUFSIZE 8
> #define ERRORSTRING "error on setting buf\n"
> int main()
> {
>   char inbuf[BUFSIZE+1];
>   char outbuf[BUFSIZE+1];
>   int c;
>   if (setvbuf(stdin, inbuf, _IOFBF, BUFSIZE)) goto error;

stdin is a static FILE * instance.

> As you see I got fully buffered stdin and stdout.

Which are FILE * not filehandle.

> I can even use the
> simple getc() and got it buffered.

Which works on a FILE *, not a filehandle.

As I said, libc does implement this for you: badly.

The problem with fscanf() is that although I can limit the length of the
input (in a non-obvious way: "%47s") so it doesn't write past the end of
the buffer I've allocated.  Until recently, there was no way to tell it
to automatically allocate a "large enough" buffer (the gnu guys added an
"a" conversion specifier, posix went with "m").  And the fact this was a
retcon is pretty clear in the fact that %c still requires a length,
otherwise defaulting to one.  There's no way to say "read until end of
line" except to give an arbitrarily big length limiter.

Due to the funky way %s handles whitespace I can't get a verbatim line
out of it, and %c doesn't stop at newlines (it's a block read until it
fills the buffer).  I _might_ be able to abuse %[blah] to do what I
want, but this has its' own problems:

Even if %s didn't eat spaces and tabs (or I abuse %[] into acting like
%c but stopping at newlines), it doesn't _return_ the terminator
character.  So an exact match means... what?  There's more data?  We
exactly read a whole line?  If I go back and read more, will I get the
rest of this line or silently concatenate the following line onto this one?

Basically, it's _easier_ to write my own getline() than try to beat
sense out of the built-in FILE * functions.  Unless something new showed
up in posix 2008 I missed, which is quite possible...

> I can't see the point where this may
> break xargs (or the command it calls?) even with -E and "-n 1".

Did the above explain it?  Read past the end of the buffer, sucking too
much data out of the OS, and unable to feed the extra data _back_ into
the OS's filehandle because it's unidirectional?  So when I do the exec
the child doesnt' get all the data unless the parent sets up a pipe to
manually forward it?

>> bytes at a time is 100 times faster than one byte at a time).  But if
>> something later wants a filehandle instead of a file pointer, which
> 
> Of course the bytes in the buffer are lost or you must use a wrapper for
> using stdio and raw io. :-)

Um, yes.  That's my _point_.

>> includes all child programs that want to read from the same filehandle...
> 
> IMHO I'm missing the point.

Yeah, I noticed.

> Can you please explain it more in detail?

Apparently not.

> If a process spawns a child he may also control its standard file descriptors.

I can write extra code to work around the limitation: or I could write
extra code to avoid getting into that situation in the first place.  The
workaround has to be in every user, so I did get_line() a byte at a time
instead.

Didn't say I'm happy with it, but this way I only have to fix it once.

>> There are ways around this: the xargs example could instead create a
>> pipe, feed the rest of its buffer into that, and then pass along further
>> data fila a select/poll loop a bit like netcat does.  But this requires
>> an extra process stick around to pass along the data instead of merely
>> doing an exec().
> 
> Don't understand this, too. For what reason a pipe?

So the parent can create a synthetic stdin filehandle 0 for the child
which supplies _all_ the data, since the upstream one no longer supplies
all the data because we took too much out and can't put it back into the
original filehandle.

> I guess
> it's better if you post some example code (or example pseudo code). :-)

I'm a dozen messages behind right now, and need to do day job stuff.
Maybe this weekend...

> Frank

Rob

 1329194086.0