[Toybox] strings tests and bugfixes

Mon Jul 3 08:28:52 PDT 2017

On 06/14/2017 08:04 AM, Ilya Kuzmich wrote:
> ping?

Sorry again for the delay! (I have sooooo many open tabs. Trying to
close a few.)

>>> I glanced at this one to see if it was trivial to apply but:
>>>
>>> -    if (nread < 1) break;
>>> +    if (nread < 1) {
>>> +      if (count == wlen) xputc('\n');
>>> +      break;
>>> +    }
>>>
>>> We have \n at the end of strings already? Under what circumstances would
>>> it _not_ output them?
>>>
>>>   make strings
>>>   $ diff -u <(./strings strings) <(strings strings) | wc
>>>   	0	0	0
>> current implementation doesn't output newline if last byte does matches:
>> if (((toybuf[i] >= 32) && (toybuf[i] <= 126)) || (toybuf[i] == '\t')) {

Which violates posix:

  http://pubs.opengroup.org/onlinepubs/9699919799/utilities/strings.html

  A printable string is any sequence of four (by default) or more
  printable characters terminated by a <newline> or NUL character.

And which never happens in a file conforming to the ELF spec, which uses
NUL. (The purpose of this command is to show strings "in a binary file",
which originally meant executable format du jour. The current one in
ubuntu seems to default to "-a" which shows all printable text, but it's
still got the plumbing to parse ELF sections if you ask it to because
that was the original purpose. I vaguely recall their ELF traversal had
something stupidly exploitable and they switched to -a because of it?
Not online to look it up right now...)

That said, _we_ violate posix here because I'm not reading ahead and
buffering unlimited data before displaying the output (just the -n
amount, gneerally 4 bytes), which means I'm not enforcing the
nul/newline check because the really simple algorithm I'm doing just
treats end of string as data that's not printable.

Except it's not so much violation as "historical usage" according to the
"Application Usage" section of posix:

  By default the data area (as opposed to the text, "bss", or header
  areas) of a binary executable file is scanned. Implementations
  document which areas are scanned.

  Some historical implementations do not require NUL or <newline>
  terminators for strings to permit those languages that do not use NUL
  as a string terminator to have their strings written.

You have a point that if we start printing the string, we might as well
finish printing it with the newline. But reading ahead and confirming we
have a newline or NULL would cut down on the 4-byte false positives.
(I'm tempted to do it _just_ for the first byte after what we've
buffered, if that's out of spec but not NUL or \n don't print. But
that's of of those "easy to implement but hard to document" things that
make the command _conceptually_ hard to understand. Blah.)

Once again, if somebody ELSE had come to me and pointed out the
deviation I'd probably take it seriously, but if it's been in the code
for 3 years and I'm the first person to notice...

>>> As for the offset+1 bit...
>>>
>>>   $ ./strings -o strings | head -n 3
>>>       567 /lib64/ld-linux-x86-64.so.2
>>>       646 g"X7
>>>      1600 libc.so.6
>>>   $ strings -o strings | head -n 3
>>>      1070 /lib64/ld-linux-x86-64.so.2
>>>      1207 g"X7
>>>      3101 libc.so.6

Posix hasn't got -o. It has "-t d" instead. See the "Rationale" section
for why they changed it. (Basically the octal vs decimal issue.)

I have a todo item about implementing posix -t, but I was waiting for
somebody to need it and you still haven't asked about that, but I might
as well do it while I'm looking at it now...

Sigh, and how much do I care about my other todo item (utf8 support)?
Ubuntu 14.04 doesn't:

  $ strings -a tests/files/utf8/japan.txt
  $

That's probably one of the big reasons I'm the first person to notice
problems with the command. It's verging on obsolete...

>>> I don't see how +1 addresses that? (Or what exactly is going on there.)
>> gnu strings use octal offsets by default

And I used decimal because the strings man page says:

  -o  Like -t o.  Some other versions of strings have -o act like -t d
      instead.  Since we can not be compatible with both ways, we simply
      chose one.

Seemed like something I could choose, so I did.

The vestigial octal in unix (file permissions and such) is left over
from its initial implementation on a pdp-7, which was an 18 bit machine.
(I.E. 3*6, each word was 3 octal bytes.) They switched to 8 bit bytes
when they ported it to the pdp-11 (up and running in March 1971
according to https://www.bell-labs.com/usr/dmr/www/notes.html). Binutils
doing new octal after that was just silly (gcc 1.0 came out in 1987, 16
years after octal output made sense).

That said, busybox mindlessly copied binutils here. Does something
actually use this?

>> toybox strings use decimal, but one-off:

Hmmm... hexdump -c agrees that the offset isn't where the data starts.
That's a bug.

>> $ echo fooo | ./toybox strings -o
>>      -1 fooo

Alright, let's look at the patch again. Yay new test suite entries, the
special case test makes me uncomfortable (why do all the other string
prints _not_ need it... ah, It's duplicating an identical test at the
end of the loop; ouch), and the +1 seems like patching around math
elsewhere being done wrong... Ah, the increment's in the wrong place.

Yeah, I used a goto fixing the duplicate test. Sigh. There's probably a
better way to do that but I'm really tired right now...

Alright, let's look at posix again:

  If the first argument is '-' , the results are unspecified.

Thank you, posix. Question: What if the _third_ argument is '-'?
(Answer: Joerg Schilling says Linux isn't Solaris and thus doesn't
matter. Same answer regardless of question, really.)

The description of -a implies that always doing -a is conformant, but
that still doesn't get us off the hook with not NUL or newline
terminated, so document a deviation...

It's sort of adorable that posix-2013 still thinks "the format of the
output shall be: "%o %s", <byte offset>, <string>" is a good answer
post-y2k when we have files longer than 2 gigabytes. Yes, they specify
that byte offset is a signed int (LP64 says that's 32 bits). Unless
there's some magic context in basedefs, it's been years since I reread
that. I miss having free time...

And then...

  The ISO C standard function isprint() is restricted to a domain of
  unsigned char. This volume of POSIX.1-2008 requires implementations
  to write strings as defined by the current locale.

This is the first mention of isprint() in this page...?

Sigh. I should rewrite this whole thing to use crunch_str() shouldn't I?
Maybe just utf8len. Lemme check in _this_ rewrite before doing that...

Rob