[Toybox] gunzip.c cleanup

Wed Jan 17 15:30:28 PST 2018

On Sun, Jan 14, 2018 at 10:29 AM, Rob Landley <rob at landley.net> wrote:
> On 01/08/2018 07:27 PM, enh wrote:
>> but, yeah, i should have said so in TODO comments. (i thought i did
>> comment the former, but that may have been lost in the toybox
>> obfuscated c competition rewrite.)
>
> I should explain what all those changes are for:
>
> First hunk: make -12345 switch off the others so it's "last wins"
> instead of "lowest wins".
>
> Second hunk: Move zlib.h #include after #include toys.h because
> #includes pull in other #includes and some of them care about #defines
> happening first so I generally have local includes after the #include
> toybox wherever possible to avoid perturbing the existing include sequence.
>
> Next hunk: gzerror_exit() became gzerror_msg() so that "gunzip one.gz
> two.gz three.gz" could continue after errors because that's what the
> other one does.
>
> I factored out the zlib_inflate() call into its own function so I can
> have builtin vs zlib versions as a config option (there's already a
> builtin deflate in compress.c that I want this to be able to use, I
> should probably ove it into ).
>
> The other reason is the zlib api it's currently using passes through
> uncompressed data instead of erroring out. (I.E. on ubuntu "zcat README"
> says "gzip: README: not in gzip format" and ours cats it, which means it
> will ALSO cat bunzip2 or xz data rather than detecting you've chosen the
> wrong compression format. That's a significant behavioral difference
> because it's using a wrapper around inflate(). I want to switch it to
> use zlib's inflate so it can detect errors, it's covered in
> https://www.zlib.net/zlib_how.html but I haven't done that yet.)
>
> The changes to do_gunzip() are also half-finished, I checked in what I
> had when I ran out of time to work on it. (Sorry.) But here's a few of
> the highlights:
>
> - FILE *out;
> + int out_fd=0;
> -    out = xfdopen(out_fd, "w");
>
> The previous code was using a FILE * wrapper around filehandles, I made
> it just use filehandles consistently rather than mix the two.
>
> The old code was using FLAG_c in three places and the new stuff is using
> FLAG_c in one place. the old code was doing strdup() to create temporary
> copies of all the filenames, but only the output filename was ever
> getting modified and only in one codepath. I switched from having a
> separate "both_files" variable to having out_name be NULL when we didn't
> create an output file.
>
> The new code tries to take advantage of the fact loopfiles()
> automatically knows that when you have no arguments, you should read
> from stdin and the name is "-". (The comment about stderr is wrong, my
> bad. As I said, unfinished.)
>
> The old code had a lot of "-" being magic:
>
> -  } else if (!strcmp(arg, "-")) {
> -    // "-" means stdin; assume output to stdout.
> -    // TODO: require -f to read compressed data from tty?
> -    in_name = strdup("-");
> -    out_name = strdup("-");
> -  } else error_exit("unknown suffix: %s", arg);
> -  if (toys.optflags&FLAG_c) {
> -    free(out_name);
> -    out_name = strdup("-");
>
> But since out_name is only really ever used in xcreate() (which doesn't
> know that "-" means stdout), and in_name doesn't get modified, this is
> all unnecessary. There were xopen() changes last year so it never
> returns stdin/stdout/stderr by default (even if the command is run with
> those filehandles closed, see the notstdio() function in lib/xwrap.c
> which moves the filehandle out of that range), so we can reliably check
> the fd and know 0 means stdin and 1 means stdout. So just do that.
>
> The new code uses -f to read from a tty regardless of whether it's stdin
> or not. This is because:
>
>   $ sudo gunzip /dev/ttyS0
>   gzip: /dev/ttyS0 is not a directory or a regular file - ignored
>
> Is something you might conceivably want to do, and if we've already
> _got_ a -f that means that, why not have it apply?
>
> A lot of changes like:
>
> -  } else error_exit("unknown suffix: %s", arg);
> -  both_files = strcmp(in_name, "-") && strcmp(out_name, "-");
> -  if (both_files) xstat(in_name, &sb);
>
> becoming
>
> +  if (!out_fd) {
> +    // "gunzip x.gz" will decompress "x.gz" to "x".
> +    if ((len = strlen(arg))<4 || strcmp(arg+len-3, ".gz")) {
> +      error_msg("no .gz: %s", arg);
> +      return;
> +    }
> +    if (!stat(arg, &sb)) {
> +      perror_msg("%s", arg);
> +      return;
> +    }
>
> Are so it can warn and continue instead of error exiting. Yes it's
> longer, but it's what the command should do.
>
> The "unknown suffix: %s" becoming "no .gz" is that design.html bit about
> not expecting more than a dozen words of english out of Xiaomi's
> userbase, although the old one printing the name it barfed on is good
> and I should add that back. "no .gz: %s"
>
> Ok, next function: I inlined do_gz() into gzip_main() not just because
> it's 6 fewer lines but because if the call to loopfiles() performs the
> test it happens once instead of happening every argument.
>
> I could have left the flag modifying like it was but I thought a for
> loop from 0 to 9 in the actual level variable, with setting a default
> value if it fell off the end, was more straightforward. (If nothing else
> it avoids the need for a temporary variable.) The fact I'm then
> reversing the result at the end undermines that though. If you really
> care I can put it back.
>
> We're not checking FLAG_c here, that's part of the consolidation of
> FLAG_c stuff to the single place it's tested.
>
> And there's the inlining of the test into loopfiles().
>
> I need to retest all of this stuff, but with it using the old gzip api
> that gets zcat README wrong, I already have tests fail for the wrong
> reasons and will have to completely retest after the conversion anyway.
> (And then again when I add the deflate() code from compress.c, which
> should probably go into lib/ even though bunzip2 didn't because zip.c
> also wants to use it and we're not pulling in bzlib.so, so this _is_
> kind of special. It's probably the only symmetrical compression and
> decompression toybox will have, the other two just decompress and that's
> probablyf ine.)
>
> This is also where I looked at the test suite and said "comparing the
> gzip output against the gunzip output, we should really have canned
> files for it to check against the way bunzip2 does" and that's part of
> the same todo item.
>
> So I need to finish it, and didn't have time then. But I didn't want it
> to be yet another "I've done 80% of a coding pass, let this fester in my
> tree for 6 months and then get reverted" thing like I've got so many of,
> so I went ahead and hecked it in because it's in pending. Sorry I forgot
> that Android is using stuff out of pending. My bad for falling this far
> behind and _having_ stuff android is using still be in pending.

i think the fix is to have more, smaller commits rather than trying to
do everything in one commit.

and to always run the tests before committing...

> I'm also sorry I didn't have time to do this level of writeup about it
> at the time, but my day job has been cutting into my hobby time for
> quite a while now. I take my volunteer responsibilities very seriously,
> but work still has dibs on my cycles.
>
> And yes, this command needs a follow-up cleanup pass adding comments and
> focusing on "is this bit readable" enough. I try to do that at the time
> but you always need to do later passes for 1) readability, 2)
> security/bugcheck, 3) functionality and standards compliance, 4) test
> coverage. And those passes take place _after_ the code is otherwise
> "done" (or at least checkpointed). And this hasn't gotten to that point
> yet, which is why it's still in pending.
>
> Rob