[Toybox] toybox - added cmp

Thu Feb 9 03:46:50 PST 2012

On 02/08/2012 03:13 AM, Tim Elliott wrote:
> On Tue, Feb 7, 2012 at 6:09 PM, Rob Landley <rob at landley.net> wrote:
>> The optargs stuff recently grew "i#<0" where the <0 reads "at least 0",
>> I.E. error out if this argument is less than 0.
> 
> That is nice. Looks like the default can go in there too.
> 
>>>   * I noticed get_line() in lib/lib.c. Should I be using that instead?
>>
>> Probably.  get_line() and get_rawline() read input a line at a time.
>>
>> I note that they do so with one syscall per character, but since I dunno
>> how to push input back into a filehandle...
>>
>> (This was back when I was still resisting FILE * as unnecessary
>> overhead.  I got over it.)
> 
> Can you explain why get_rawline() reads input one char at a time, why
> you would want to push data back into the filehandle, and why you
> resisted FILE *? Hope I'm not distracting too much...

When get_line() reads data, it can't know where the terminator character
occurs ahead of time.  So either it makes sure not to read past the end
of the line (by reading one character at a time), or it has to store the
beginning of the line somewhere until next time we need it.

The problem is, "next time we need it" is not well-defined.  Think about
xargs' -E option, specifying the end of file string.  Can xargs then
pass through the rest of stdin to the program it execs?  Not if it
already read an arbitrary-sized chunk of it into an internal buffer so
further reads from the filehandle start who-knows-where.

FILE * exists to have a standard place to put the buffer.  As long as
future functions your program calls uses that, then it can do "read
until you find X in the data" in decent-sized chunks (even just 100
bytes at a time is 100 times faster than one byte at a time).  But if
something later wants a filehandle instead of a file pointer, which
includes all child programs that want to read from the same filehandle...

There are ways around this: the xargs example could instead create a
pipe, feed the rest of its buffer into that, and then pass along further
data fila a select/poll loop a bit like netcat does.  But this requires
an extra process stick around to pass along the data instead of merely
doing an exec().

Another variant is calling lseek() to back the filehandle up, but pipes
aren't seekable so "cat blah | xargs -e EOF thingy" couldn't use that...

> Hmm.. looking at get_rawline() in lib.c:
> 
>> for (;;) {
>> 	if (1>read(fd, &c, 1)) break;
>> 	if (!(len & 63)) buf=xrealloc(buf, len+65);
>> 	if ((buf[len++]=c) == end) break;
>> }
> 
> If the above gets a large input that has no newlines, won't it run out
> of memory?

Yup, but that's inherent in what it's trying to do.  (Sort can run out
of memory if you give it a big enough file no matter how many lines its'
broken into, it's got to keep them all in memory at once in order to
sort them before outputting the data.  The last line it reads could
always be the first line in sorted order.)

> You go through a bit of effort to avoid a malloc/free here:
> http://www.landley.net/hg/toybox/rev/7cff5420c90a#l20

Not really, it's actually _less_ effort to use toybuf than to malloc
there. (toybuf already exists, and is a constant length, so using half
of it for each purpose is pretty straightforward.)

> Why is the extra xrealloc/free worthwhile for get_rawline()?

Because the input data is of arbitrary length (don't want to impose a 4k
limit), and it gets returned to the caller to be used for who knows what
(in the case of sort or xargs they'll be calling get_line() more than
once before using the results, so the lines need to be separate entities
instead of reusing the same buffer).

I.E.

xargs -n 2 diff -u < EOF
one
two
three
four
EOF

Results in two calls:

  diff -u one two
  diff -u three four

Which is not the same this:

  diff -u one two three four

Or this:

  diff -u one
  diff -u two
  diff -u three
  diff -u four

(By the way, realloc() is actually pretty cheap if there's already free
space in the heap after this allocation, all it has to do is resize the
allocation to use that free space.  Yeah it can hit a couple bumps
having to malloc() a new chunk, memcpy() the old data into the new
chunk, and return the new pointer, when you're iterating and doing
realloc() a dozen times to get slightly larger each time the common case
is actually _not_ having to do that.)

> Cheers,
> Tim

Rob

 1328788010.0