[Toybox] Data compression and hermetic builds.

Rob Landley rob at landley.net
Wed Jul 11 14:24:04 PDT 2018


Something that should probably be documented but I'm not sure quite where. (It's
not exactly roadmap.html, not quite design.html, not quite a FAQ entry...)

Conclusion:

I should implement gzip compression side, and all the rest should be
decompressors only.

Reasoning:

In order to drive reproducible builds in a hermetic auditable build environment,
toybox needs be able to decompress incoming archives (source code and other
resources) in externally defined formats, and should be able to create archive
files (output bundles and such, and maybe there's streaming uses in rsync or
httpd or something that are worth doing) in a known format.

This implies that toybox needs multiple decompressors (because we don't know
what format those incoming archives are in and being unable to cope with input
is inconvenient), but only really needs one compressor and the 80/20 rule
applies: a fairly crappy compressor still gets you most of the benefit vs no
compression at all. (This is also a variant on being lenient what you accept,
rigorous what you emit.)

The simplest, most flexible, most ubiqitous, and most standardized compressor is
the "deflate" aglrithm. Deflate's an IETF standard, it's a small amount of code
with small memory usage, and it's used as both an archive and streaming
compressor. Even though kernel.org source tarballs dropped bz2 for xz, they
still provided gz last I checked. Git uses zlib behind the scenes for both
storage and transport, and things like ssh -c are zlib under the covers (and
then the zip file format's been used for everything from java jar files to odf).

It's old, generic, reliable, and unlikely to be replaced. There were a dozen
compression formats before 'deflate' (zip itself had "unzip, expand, explode"
and it competed with arj/zoo/lharc and so on...), and there have been plenty
afterwards. The "best of breed" changes every few years (the kernel's initramfs
compression currently offers lzma, xz, lzo, and lz4. At least one of those
increase compression via multiple architecture-specific machine code compressors.)

Deflate's a sweet spot that's survived: gzip/zlib/info-zip code from 25 years
ago is still relevant. That sounds like the compressor for toybox. People who
want compression-side for the rest can build/install the relevant packages.

Rob



More information about the Toybox mailing list