[Toybox] On hermetic builds.
rob at landley.net
Sun Jan 21 13:30:53 PST 2018
Half the reason for the ongoing gzip work is I want to move the local
deflate implementation into lib/deflate.c, because it's shared by gzip
and zip, and might be of interest to some other stuff. (It's the only
one I plan to do the compression-side for, the rest are just decompressors.)
The _reason_ for that last parenthetical is toybox's goal of enabling
what google calls "hermetic builds", I.E. package builds that provide
all their own prerequisites, and thus build reliably and portably and
even bit-for-bit reproducibly on as many different systems as possible.
(Portability applying to "the future" as much as the variety of today's
The complete-the-circle case of hermetic builds is the minimal source
bootstrap, I.E. the smallest system that can rebuild itself under itself
from source code. A contemporary solution to this problem was
demonstrated by https://landley.net/aboriginal/about.html, and
https://github.com/landley/mkroot is a simpler more modern version
currently in development. My earlier time working on busybox was aimed
at making aboriginal linux self-hosting, and toybox's roadmap leading to
the 1.0 release is organized around solving this problem using new code
unencumbered by problematic licensing.
Toybox aims to provide a complete hermetic build environment in a
minimal number of packages. This minimum gives boundaries to a
full-system security audit, allows students to read the code of a
complete working system in a finite amount of time, minimizes the amount
to port to new contexts, minimizes the number of programming languages
you need to learn to understand "the system"...
Conceptually this minimum is 4 packages*: kernel (linux), command line
(toybox), libc (musl**), and C compiler*** toolchain (qcc). My old
aboriginal linux project provided a working example of a self-contained
minimal build using seven packages: linux, busybox/make/bash, uClbic,
and binutils/gcc. Then as proof of concept it build Linux From Scratch
6.3 under the result. (Presumably enough to build anything else under,
the programming version of "reducing to an earlier solved problem".) The
saga of me doing that (and accidentally becoming busybox maintainer on
the way) is detailed in http://landley.net/aboriginal/history.html
But getting a real build cycle down to 4 packages (which the "minimal"
and "self-contained" goals strive towards) means toybox has to include
its own implementations of things like zlib and curses if it wants to
provide that functionality
To download and build source packages, you need to be able to parse
incoming tarballs in all three popular formats (tar.gz, tar.bz2, tar.xz)
but you only really need to be able to _create_ one. Gzip is useful both
as an archiver (it's the 80/20 solution for archiving) and as a
streaming protocol. (Plus the zip file format has been used for all
sorts of things, from java jar files to the archives you reimage an
android phone with. And see my recent post about wanting to loopback
mount 'em, basically a simpler squashfs.)
So "deflate" is functionality toybox probably needs to provide, which
means toybox should include its own inflate and deflate implementations
as part of its 1.0 release.
tl;dr: That's why I fiddled with gzip.
* Each of these packages has good reasons to be separate. First, there's
multiple iplementations you can swap out (llvm for bcc, bsd for linux,
busybox for toybox, bionic for musl). Second each one deals with its own
problem domain: the kernel is full of drivers for specific hardware and
runs in a different context (ring 0) providing a defined API (system
calls and such) to the rest of the system. The C library provides both
generic functionality every userspace program needs and acts as a glue
layer between the portable-ish c99+posix and a given kernel's system
calls. The C compiler has a lot of processor-specific logic for assembly
language parsing and code generation, and implements the C99 standard.
And the command line utilities are all called from main(argc, argv)
with environment variables, and run in a mostly architecture-independent
context. Projects like bsd and xv6 have (historically) merged these
together into one package, and it was a bad idea.
** This was bionic until they started rewriting it in C++. The other
packages are all written in C, which is a much simpler language
providing minimal abstraction between what the programmer wrote and the
machine language the hardware interprets. Ten years ago tinycc was an
example of a c99 compiler in 100k lines of code that didn't even provide
a nontrivial optimizer, but which booted the linux kernel, ala
*** Bootstrapping this up to a full modern distro**** would involve
writing a modern version of cfront that tinycc/qcc***** could build, and
then building llvm with the result. But this shouldn't be necessary in
the base system. A simpler bootstrap kernel than Linux might be a good
idea too, mit's xv6 and google fuchsia are playing in that space, as are
a number of bootloaders and embedded systems.
**** Bootstrapping this up to android involves writing a read-only git
engine capable of being driven from repo to download repositories from
git servers, check out working directories, and probably handle
subtrees. I expect "git bisect" would be in scope to. Checking anythying
_in_ (merging and commits) would not be in the 1.0 of that. Note the
distinction between 'build environment' and 'development environment',
which I covered in
back in 2008.
***** qcc is https://landley.net/qcc and somebody PLEASE steal that idea
so I don't have to do it after toybox and mkroot and breaking down AOSP
into orthogonal build stages...
More information about the Toybox