[Toybox] gunzip.c cleanup

Rob Landley rob at landley.net
Sun Jan 14 10:29:51 PST 2018


On 01/08/2018 07:27 PM, enh wrote:
> but, yeah, i should have said so in TODO comments. (i thought i did
> comment the former, but that may have been lost in the toybox
> obfuscated c competition rewrite.)

I should explain what all those changes are for:

First hunk: make -12345 switch off the others so it's "last wins"
instead of "lowest wins".

Second hunk: Move zlib.h #include after #include toys.h because
#includes pull in other #includes and some of them care about #defines
happening first so I generally have local includes after the #include
toybox wherever possible to avoid perturbing the existing include sequence.

Next hunk: gzerror_exit() became gzerror_msg() so that "gunzip one.gz
two.gz three.gz" could continue after errors because that's what the
other one does.

I factored out the zlib_inflate() call into its own function so I can
have builtin vs zlib versions as a config option (there's already a
builtin deflate in compress.c that I want this to be able to use, I
should probably ove it into ).

The other reason is the zlib api it's currently using passes through
uncompressed data instead of erroring out. (I.E. on ubuntu "zcat README"
says "gzip: README: not in gzip format" and ours cats it, which means it
will ALSO cat bunzip2 or xz data rather than detecting you've chosen the
wrong compression format. That's a significant behavioral difference
because it's using a wrapper around inflate(). I want to switch it to
use zlib's inflate so it can detect errors, it's covered in
https://www.zlib.net/zlib_how.html but I haven't done that yet.)

The changes to do_gunzip() are also half-finished, I checked in what I
had when I ran out of time to work on it. (Sorry.) But here's a few of
the highlights:

- FILE *out;
+ int out_fd=0;
-    out = xfdopen(out_fd, "w");

The previous code was using a FILE * wrapper around filehandles, I made
it just use filehandles consistently rather than mix the two.

The old code was using FLAG_c in three places and the new stuff is using
FLAG_c in one place. the old code was doing strdup() to create temporary
copies of all the filenames, but only the output filename was ever
getting modified and only in one codepath. I switched from having a
separate "both_files" variable to having out_name be NULL when we didn't
create an output file.

The new code tries to take advantage of the fact loopfiles()
automatically knows that when you have no arguments, you should read
from stdin and the name is "-". (The comment about stderr is wrong, my
bad. As I said, unfinished.)

The old code had a lot of "-" being magic:

-  } else if (!strcmp(arg, "-")) {
-    // "-" means stdin; assume output to stdout.
-    // TODO: require -f to read compressed data from tty?
-    in_name = strdup("-");
-    out_name = strdup("-");
-  } else error_exit("unknown suffix: %s", arg);
-  if (toys.optflags&FLAG_c) {
-    free(out_name);
-    out_name = strdup("-");

But since out_name is only really ever used in xcreate() (which doesn't
know that "-" means stdout), and in_name doesn't get modified, this is
all unnecessary. There were xopen() changes last year so it never
returns stdin/stdout/stderr by default (even if the command is run with
those filehandles closed, see the notstdio() function in lib/xwrap.c
which moves the filehandle out of that range), so we can reliably check
the fd and know 0 means stdin and 1 means stdout. So just do that.

The new code uses -f to read from a tty regardless of whether it's stdin
or not. This is because:

  $ sudo gunzip /dev/ttyS0
  gzip: /dev/ttyS0 is not a directory or a regular file - ignored

Is something you might conceivably want to do, and if we've already
_got_ a -f that means that, why not have it apply?

A lot of changes like:

-  } else error_exit("unknown suffix: %s", arg);
-  both_files = strcmp(in_name, "-") && strcmp(out_name, "-");
-  if (both_files) xstat(in_name, &sb);

becoming

+  if (!out_fd) {
+    // "gunzip x.gz" will decompress "x.gz" to "x".
+    if ((len = strlen(arg))<4 || strcmp(arg+len-3, ".gz")) {
+      error_msg("no .gz: %s", arg);
+      return;
+    }
+    if (!stat(arg, &sb)) {
+      perror_msg("%s", arg);
+      return;
+    }

Are so it can warn and continue instead of error exiting. Yes it's
longer, but it's what the command should do.

The "unknown suffix: %s" becoming "no .gz" is that design.html bit about
not expecting more than a dozen words of english out of Xiaomi's
userbase, although the old one printing the name it barfed on is good
and I should add that back. "no .gz: %s"

Ok, next function: I inlined do_gz() into gzip_main() not just because
it's 6 fewer lines but because if the call to loopfiles() performs the
test it happens once instead of happening every argument.

I could have left the flag modifying like it was but I thought a for
loop from 0 to 9 in the actual level variable, with setting a default
value if it fell off the end, was more straightforward. (If nothing else
it avoids the need for a temporary variable.) The fact I'm then
reversing the result at the end undermines that though. If you really
care I can put it back.

We're not checking FLAG_c here, that's part of the consolidation of
FLAG_c stuff to the single place it's tested.

And there's the inlining of the test into loopfiles().

I need to retest all of this stuff, but with it using the old gzip api
that gets zcat README wrong, I already have tests fail for the wrong
reasons and will have to completely retest after the conversion anyway.
(And then again when I add the deflate() code from compress.c, which
should probably go into lib/ even though bunzip2 didn't because zip.c
also wants to use it and we're not pulling in bzlib.so, so this _is_
kind of special. It's probably the only symmetrical compression and
decompression toybox will have, the other two just decompress and that's
probablyf ine.)

This is also where I looked at the test suite and said "comparing the
gzip output against the gunzip output, we should really have canned
files for it to check against the way bunzip2 does" and that's part of
the same todo item.

So I need to finish it, and didn't have time then. But I didn't want it
to be yet another "I've done 80% of a coding pass, let this fester in my
tree for 6 months and then get reverted" thing like I've got so many of,
so I went ahead and hecked it in because it's in pending. Sorry I forgot
that Android is using stuff out of pending. My bad for falling this far
behind and _having_ stuff android is using still be in pending.

I'm also sorry I didn't have time to do this level of writeup about it
at the time, but my day job has been cutting into my hobby time for
quite a while now. I take my volunteer responsibilities very seriously,
but work still has dibs on my cycles.

And yes, this command needs a follow-up cleanup pass adding comments and
focusing on "is this bit readable" enough. I try to do that at the time
but you always need to do later passes for 1) readability, 2)
security/bugcheck, 3) functionality and standards compliance, 4) test
coverage. And those passes take place _after_ the code is otherwise
"done" (or at least checkpointed). And this hasn't gotten to that point
yet, which is why it's still in pending.

Rob



More information about the Toybox mailing list