[Toybox] Countering trusting trust.

Fri Jul 24 21:48:46 PDT 2020

This keeps coming up and I should have a writeup I can just point people at, so:

15 years ago when I was maintaining Busybox somebody told me the big NORAD
display at Cheyene Mountain (as recreated in the movie Wargames) ran busybox,
which surprised me: I didn't think my code was good enough to defend the country
from nuclear attack. But they explained they're required to audit every line of
source for anything running on such highly secure systems, and they'd much
rather audit a few hundred thousand lines of busybox code than tens of millions
of lines of corresponding GNU code. This, I understood.

But it doesn't matter how secure your code is if it's running in a system that's
already been compromised. The solution is to get a minimal secure base system,
audit it (have experts read every line of source), and build up from there. At
the root of any package management tree the dependencies go circular (everything
depends on everything else), so there's a base set of packages you have to start
with as a lump or nothing can run. These days, the minimal system to boot to a
shell prompt is 3 packages (kernel, libc, and application: if you're bootling
linux to a shell prompt your kernel is linux, your application is toybox, and
your libc is probably either musl or bionic).

Of course auditing the output isn't enough because your development tools could
have been compromised. Creating a new chroot from a machine that's running
spyware is not very useful. So you make a tiny self-hosting system, which can
rebuild itself from source code under itself. This is conceptually FOUR
packages: the kernel libc and toybox above, plus a compiler toolchain (which CAN
be a single package if you upgrade Fabrice Bellard's tinycc, as I proposed doing
in my qcc project but have never found time to do).

My first implementation of this concept was aboriginal linux
(https://landley.net/aboriginal/about.html) where I got the self-hosting system
(capable of building Linux From Scratch under the result as proof it could
natively bootstrap up to arbitrary complexity by downloading and compiling
source code) down to 7 packages: the kernel was linux, libc was uclibc, the set
of command line utilities was busybox, the toolchain was 2 packages (just gcc
and binutils, it hadn't yet metastasized into 5 packages, gone gplv3, and
rewritten itself in C++), and then I needed 2 more packages (make and bash)
because the corresponding busybox commands were missing or not yet good enough.

My new one is based on mkroot (https://landley.net/toybox/faq.html#mkroot) with
cross and native compilers from musl-cross-make (via scripts/mcm-buildall.sh in
this source ala https://landley.net/toybox/faq.html#cross). Eventually I'd like
to implement https://landley.net/qcc and get it down to the theoretical 4
packages, but it's a work in progress and nobody ever wants to fund this stuff
(ala https://elinux.org/CELF_Project_Proposal/Combine_tcg_with_tcc) so I can
only throw scraps of hobby time/energy at it.

But then the NEXT step of paranoia is Ken Thompson's "trusting trust" attack,
where the creator of unix modified the early BSD compiler to recognize and hack
the login program (so the login binary contained an exploit the login.c source
didn't, a hardwired "ken" account with a fixed password), and then he added a
SECOND part so the compiler would recognize and hack itself (inserting the
original exploit for login and the new one for cc) so now the COMPILER binary
would contain an exploit even when wasn't in the compiler source. Then he
removed the changes from the compiler source, rebuilt it with the modified
binary to make sure the exploit propagated from compiler binary to compiler
binary without being in the source code, and sent it to berkeley so he could
always log into his students' system. Years later, when the ACM gave him a
lifetime achievement award, he told this story:
https://dl.acm.org/doi/pdf/10.1145/358198.358210

The first defense against this (presented in a PHD thesis
https://dwheeler.com/trusting-trust/) is "countering trusting trust through
diverse double compiling", I.E. compile your compiler's source with a DIFFERENT
compiler, then rebuild it with the resulting output, to wash away any
binary-only hacks that can't propagate through code they don't recognize.

But the only definitive defense is to audit the binaries of your minimal native
development environment, not just their source. Due to the prevalence of viruses
on windows, an entire industry of binary auditers have grown up reverse
engineering exploit du jour, with companies like Veracode that employ them. Most
of the good ones seem to be women, presumably because they've been guarding
themselves from asshole men spiking their drinks for their entire careers, and
seem to wind up specializing in security out of self defense. On twitter I
followed @0xabad1dea @aloria @hacks4pancakes @malwareunicorn @fox0x01 and so on...

Presumably packages you add while "building up" only require source auditing,
when they can be built from audited source using your reproducible environment
of known good binaries.

This is why the minimal native development environment needs to be small,
simple, and understandable. It needs to be maintainable, but it also needs to be
auditable. It should require no external dependencies because they add to the
pile of things that need not just _source_ auditing but _binary_ auditing. (And
it can't be a one-time thing, you need to periodically re-audit it to make sure
nobody's pulled something funny.) Having the entire minimal base system written
in the same language helps simplify the auditing process, and since linux and
most libc implementations are already written in C that was the logical language
to write toybox in (before working that out I was considering lua). Tinycc is
written in C and qcc (tinycc+qemu's tcg) should be written in C, there's a plan
to use llvm-cbe as an improved cfront but how much a source audit of llvm-cbe's
output of clang differs from a binary audit is an open question. And C was
initially designed as a "portable assembly language", with reverse compilation
tools and a flourishing community of reverse engineers that recreate lost source
code for fun (https://www.youtube.com/watch?v=5tADL_fmsHQ).

One other note: if you can't reproduce it, what you're doing is not science. If
you can't recreate an experiment from first principles under laboratory
conditions, it's just alchemy. The ability to regularly reproduce the minimal
native development environment and bootstrap your way up to arbitrary complexity
in an automated fashion is an important regression test.

Oh, and the native builds I'm doing are architecture-agnositic: the build of the
native system targets x86 or arm or superh or powerpc, and then the package
builds within that system do the normal configure/make/install dance as native
builds, by default not caring what architecture they're on. (That's the
"portable" part of C being a "portable assembly language". Most scripting
languages care even less, except for the #ifdef staircase in jit code generators...)

Rob

P.S. Doing the same for hardware is a whole second set of fun I'm working on
over in the https://j-core.org side of things. You need open designs and open
tools for netlist generation and place and route and such, and you have to fab
on a fully open process with no black box libraries for pads or srams, you need
to do the low-level layout yourself which most fabs won't give you the spec
sheets for, and then you have to decap the chips when they come back to you and
compare them with what you sent out, and even THEN there's some interesting
papers on compromising chips purely through selective doping.