[Toybox] [CLEANUP] uuencode.c, pass 1, base64
Rob Landley
rob at landley.net
Thu Apr 11 20:07:01 PDT 2013
Recently I did uuencode cleanup. First let's read through the
unmodified file:
http://landley.net/hg/toybox/file/829/toys/pending/uuencode.c
This shows us the following functions:
static void uuencode_b64_3bytes(char *out, const char *in, int bytes)
static void uuencode_b64_line(char *out, const char *in, int len)
static void uuencode_b64(int fd, const char *name)
static void uuencode_uu_3bytes(char *out, const char *in)
static void uuencode_uu_line(char *out, const char *in, int len)
static void uuencode_uu(int fd, const char *name)void
uuencode_main(void)
void uuencode_main(void)
The main() function calls either uuencode_uu or uuencode_b64 (depending
on whether or not it got the -m option). The encode function reads
chunks of data and calls the corresponding encode_line() function to
output a line of encoded text in the right format, and the line
function calls the corresponding encode_3bytes() to turn 3 bytes of
8-bit input into 4 characters of appropriately encoded 6-bit output.
The first round of cleanup was commit 830:
http://landley.net/hg/toybox/rev/830
The first hunk tightens up the help text. I have a fairly standard
format for help text: usage line, text description of what it does,
options one per line with a tab between the option and the description.
Someday I hope to write a help text parser that can collate subfeatures
(like "cp" has), and regular help text helps parse it so I can combine
sections.
Next was a uuencode_b64_3bytes() function. This takes up to 3 bytes of
input and outputs 4 bytes of base64. (Given 2 bytes of input, it
outputs 3 bytes and an equals. Given 1 byte, it outputs 2 bytes and two
equals.) This is completely loop unrolled, which used to be an
optimization strategy back before processors started running a dozen
times the speed of their own memory so tight loops that fit in a single
cache line trumped quick-to-execute code that spanned multiple L1 cache
lines. (Ballpark cache line size is in the 32-128 bytes range.
According to /proc/cpuinfo "clflush size" on my netbook is 64 bytes.
That's the granularity with which most memory transactions actually
take place in this processor. If you'd like to learn this stuff go to
http://kernel.org/doc and look for the links to 'ars technica ram
guide'.)
This b64_3bytes() function took an output buffer as one of its
arguments, but the output always goes to stdout, so I just wrote to
stdout in the function here (trusting the FILE * to have an internal
buffer to collate output if it matters, but uuencode isn't hugely
performance critical anyway. Actually I think xputc() might have a
fflush in it, but I mentioned it's not performance critical.)
The function had a loop to read/shift the data into an integer, and
then 4 lines to store each byte into the output buffer, and two tests
to overwrite the last two bytes if the length is short.
I replaced this with one for loop that iterated 4 times (for the 4
output bytes). Each time through, if there's still input data it's
read/shifted into the input integer, and then it writes either a byte
of data or an = depending on how many bytes of input we had and where
we are in the loop. The loop iterates 4 times because we always produce
4 bytes of output, even for short input (which only happens at the end
of a file).
There was a static table[] of value to character mappings, kept in a
constant string. I instead had uuencode_main() generate that in toybuf.
(I lean towards generating things instead of storing them statically so
you can see where they came from. Sometimes I can't, for example
toys/lsb/md5sum has the static md5table but starts with a comment that
says if I was willing to pull in floating point and libm I'd calculate
it via for(i=0; i<64; i++) md5table[i] = abs(sin(i+1))*(1<<32);
Similarly, crc_init generates the crc32 table, for both endiannesses. I
should probably have a comment that the magic constant for
little_endian is the same as the big_endian one just bit-reversed.)
Next function: uuencode_b64_line() is a wrapper function that produces
a line of output, by calling uuencode_b64_3bytes an appropriate number
of times. Just minor cleanups here for now: no need to pass along an
output buffer the b64_3bytes() function no longer uses, and no need to
print the contents when b64_3bytes() does it itself.
So now we get to uuencode_b64(), which is using toybuf. Let's stop and
trace that now-removed output buffer. Only 4 bytes of out[] were ever
filled in by b64_3bytes(), only 4 bytes were printed by b64_line(), but
in uuencode_b64() outbuf was given 64 bytes. (This is why you rework
things until they're right next to each other and you can spot this
sort of thing. By spreading it across 3 functions, the mismatch wasn't
easily noticeable.)
So that can go, but we still have a loop reading lines of data, and
thus we still need a buffer to read a line of data into. (Reading a
line is faster than reading 4 bytes, and there's per-output-line
processing, namely writing a newline, so we care when lines end anyway.
So line at a time is a logical input block size.)
The old line size was hardwired at 48 bytes of input data. (You can't
tell when inbuf is declared, but the read is 48 bytes.) The uuencode
spec actually says lines can be longer than that, specifically:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uuencode.html
> The output stream (encoded bytes) shall be represented in lines of no
> more than 76
> characters each.
This is a tiny enough buffer that declaring it on the stack is trivial
(saving toybuf for other uses), and it doesn't persist past this
function call so there's no advantage to it being global. I encoded
what the standard says into my buffer size declaration as char
buf[(76/4)*3]; (The compiler will resolve the constant math at compile
time, and meanwhile it explains where it came from. Enough 4-byte
chunks of output to total 76, and then 3 bytes of input read in each
chunk. That's our read buffer size. I can then sizeof() that in the
actual read, and because it's an array of char I don't have to say
*sizeof(char) because I know that's 1.
Note that the if (len > 0) dropped out here, because b64_line() has
while (len > 0) so that test is already performed in the function we
call.
So that's base64 encoding, and probably enough for one message. :)
Rob
More information about the Toybox
mailing list