[Toybox] Impact of global struct size

Mon Jan 8 13:59:58 PST 2024

On Fri, Jan 5, 2024 at 10:45 PM Rob Landley <rob at landley.net> wrote:
>
> On 1/2/24 16:58, Ray Gardner wrote:
> > On Mon, Jan 1, 2024 at 1:39 PM Rob Landley <rob at landley.net> wrote:
> >> ... [ a very long and detailed reply ] ...
> >
> > Rob, thank you for the "GIANT INFODUMP", and I mean that sincerely. It
> > took me a while to read it; it must have taken quite a while to write it.
>
> It did, but you asked. And posting it to the list means I can refer back to it,
> and/or more people can learn it so they don't have to ask me. :)
>
> You know how I say I document compulsively? Combine stream of consciousness
> infodump with Pascal's Apology:
>
> https://www.npr.org/sections/13.7/2014/02/03/270680304/this-could-have-been-shorter
>
> And you get documentation. Editing it DOWN, figuring out a non-dupliciative
> sequence where I'm not assuming knowledge I haven't explained yet, and chopping
> it into bite-sized chunks, is the hard part.
>
> Blathering like this is easy. Turning into a FAQ entry or something is hard.
>
> > A lot of info on kernel-level memory management, I think I got about 90%
> > of it but I'll have to look up some stuff (PLT, GOT, ...).
>
> Procedure Linkage Table and Global Offset Table. The first tracks where
> dynamically linked functions live, the second tracks dynamically linked global
> variables live.

(s/dynamically linked/position-independent/. but, yes, dynamically
linked stuff is a common case.)

> Ok, take everything here with a grain of salt because I last had to know this in
> detail back around 2010 and I largely avoid dynamic linking when I can because
> is really messy. I am PROBABLY getting this wrong, but off the top of my head:
>
> [Note: Elliott started another thread while I was traveling with this
> half-finished, and he can correct most of the stuff I get wrong. I'm also
> pointing you at where the kernel code lives, and other references.]
>
> When you exec() a file, Linux checks the executable bit (if it's not executable
> it won't even try, and the suid and sgid bits get handled here too), and then
> does some simple type identification on it, which involves waving it at the
> "binary format loaders" to see if any claim it. (This is a bit like filesystem
> probe functions during mount, only for file data instead of block device data.)
>
> $ ls linux/fs/binfmt*
> linux/fs/binfmt_elf.c        linux/fs/binfmt_flat.c
> linux/fs/binfmt_elf_fdpic.c  linux/fs/binfmt_misc.c
> linux/fs/binfmt_elf_test.c   linux/fs/binfmt_script.c
>
> (Sadly, these can all be kernel modules so you can DYNAMICALLY LOAD a BINARY
> FORMAT LOADER which is just wrong.)
>
> The main one that gets 90% of the use is binfmt_elf, the kernel's ELF executable
> loader. We'll come back to that.
>
> The "binfmt_script" one gets almost all of the rest of the use: it checks if the
> first two bytes of the file are #! and if so it re-runs the exec call with the
> /path/after/that as the new file argument, and inserting everything after the
> first space in that line as argv[1] with the remaining arguments (if any) bumped
> to argv[2] and friends. This is how shell scripts work, and the mechanism perl
> and python and so on inherited. It's also how you can use tinycc to run C as a
> scripting language with the first line being "#!/usr/bin/tinycc -run" which
> turns into "tinycc -run file.c" so it compiles, links, and executes it instead
> of writing it out to a file.
>
> And yes, it catches:
>
> $ echo \#\!$(readlink -f bang.sh) > bang.sh && chmod +x bang.sh && ./bang.sh
> bash: ./bang.sh: /home/landley/bang.sh: bad interpreter: Too many levels of
> symbolic links
>
> The elf_fdpic one is the nommu variant of elf, which REALLY SHOULD be a couple
> of if () statements in the same file but they did an ext2/ext3 thing and
> duplicated the file, but unlike deleting both of those and just having ext4
> handle all three variants of the same format in modern systems, the linux-kernel
> guys never went back and cleaned that up because linux-kernel developmet is
> almost completely ossified and bureuacratically paralyzed these days. Oh well.
>
> You can ignore binfmt_flat as obsolete. It was the nommu fork of binfmt_aout
> which was the old executable format before everybody switched to ELF in 1996.
> There was a binfmt_aout.c which got removed in kernel commit 987f20a9dcce in
> 2022. I wrote more but am making it a FOOTNOTE. (See footnote.)
>
> People mostly stopped writing new ones once binfmt_misc was invented, because
> that sucker's programmable. It's basically a binfmt_script that can be
> programmed (via /proc) to recognize arbitrary file formats and run arbitrary
> commands to handle them:
>
> https://docs.kernel.org/admin-guide/binfmt-misc.html
>
> If you've ever run an arm binary on x86 and it magically called qemu application
> emulation for you, that's because some init script setup a binfmt_misc
> association to do that.
>
> I have no idea what binfmt_elf_test is, it was introduced recently (commit
> 9e1a3ce0a952 in 2022) and from the commit message the Linux Test Project people
> crapping unnecessary complexity into the mainline kernel for no obvious reason.
> It's the kernel equivalent of checking in debug printfs. Make an EFFORT to
> ignore that one, it's NOT REAL.
>
> Ok, so back to the ELF loader. We've more or less covered static linking
> earlier, where the loader parses the tables and does a bunch of mmap() and puts
> data in the right place and jumps to the program's _start symbol (actually the
> "entry point address" field in the initial 127 byte ELF header struct at the
> start of the file, but by default the linker will stick the address of the
> _start symbol in there. You can override it with "ld -e symbolname" if you
> really want to, but you're probably skipping various setup libc kind of expects
> if you do that.

s/probably/definitely/ :-)

here's a quick list of almost everything that bionic initializes, with
brief rationale in []s, and the subset that musl also does here rather
than on-demand later/not at all marked with *s:

init HWAsan [first so we can build libc with HWASan too]
*init TLS [early so sycalls and errno work]
*init globals
init arc4random via getrandom()
create pthread_t for main thread [front-loading]
init alternate signal stack [sigaltstack(), debuggability]
init Shadow Call Stack [security]
probe VDSO [front-loading, performance]
init setjmp cookie [security]
*check AT_SECURE
init ELF TLS
init system properties [Android]
init fdsan and fdtrack [debuggability]
init MTE [security/debuggability]
init scudo
install profiling signal handlers [SIG_IGN]
install atfork handler for arc4random implementation
call ifunc resolvers [performance]
GNU RELRO [security]
*call preinit/init ELF ctors (and atexit() fini dtors!)
and finally, call:
                       ... exit(main(argc, argv, envp));

why does bionic do stuff here that it could do later? because the
"user" we're assuming is an app rather than (say) toybox, and apps are
clones of the zygote, any setup we can do once/pages we can dirty in
the zygote is a plus. (and things like the `if (!is_threaded)`
fastpaths some other libcs have are pessimizations for the code users
actually care about.)

the other stuff you see here that you might not see in other libcs is
generally security stuff that's optional/missing on other OSes but
mandatory on Android, or to deal with the fact that most of your
debugging will be of problems for users you can't contact using
devices you don't have, so actionable diagnostics are critical.

> main() actually gets called from _start() and returns to it, and
> _start lives in crt1.o ala:
>
>   readelf -a /usr/lib/*/crt1.o | less
>
> Anyway, tangent.
>
> Dynamically linked Elf binaries work a bit like binfmt_script or binfmt_misc
> above, in that instead of executing the binary directly like it does with static
> linking, for dynamic binaries binfmt_elf will find a special entry in the ELF
> headers that points at a DIFFERENT program, and runs that instead:
>
> $ readelf -a /bin/ls | grep Requesting
>       [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
>
> This is called the "dynamic linker/loader" and has a man page: "man 8 ld.so".
> The one in glibc is INSANE. The one in musl-libc is... less insane. Did you know
> that the glibc's "ldd" program that lists the libraries an ELF file is linked
> against is ACTUALLY A SHELL SCRIPT, and all it really does is set the
> LD_TRACE_LOADED_OBJECTS environment variable?

yeah, the environment variable is a bit odd. bionic uses a --list
argument and only musl just uses a symlink. (i'm guessing musl doesn't
believe in systems with both 32-bit and 64-bit userspaces at the same
time, which is something glibc and bionic both use a shell script to
handle --- it works out _which_ linker you want to call. in another
decade, musl will look wise, but it would have been very annoying for
the _last_ decade. and this seems like something bionic and glibc can
just clean up if/when we get to a point where no-one has mixed systems
any more.)

> AND if you set that yourself you
> can't run any dynamic elf binaries which has been used in all SORTS of security
> exploits (the next command won't run, just spam some stuff to stdout and return
> success):
>
> $ LD_TRACE_LOADED_OBJECTS=1 /bin/ls
>         linux-vdso.so.1 (0x00007ffd24b48000)
>         libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fce46d98000)
>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce46bd8000)
>         libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fce46b64000)
>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce46b5f000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007fce46fe2000)
>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce46b3e000)
>
> And yet despite that, /usr/bin/ldd is 193 lines long because GNU.

yeah, bionic's is quite a bit shorter:
https://cs.android.com/android/platform/superproject/+/main:bionic/linker/ldd.sh

> Anyway, GOT and PLT are arrays assembled/populated by the dynamic linker when
> it's loading and resolving the dynamic symbols for a given program.

(well, the _static_ linker "assembles/populates" these. the dynamic
linker [aka "loader"] runs through them, fixing them up at runtime
based on relocations the static linker put in the ELF file.)

> It's the "I
> already loaded that, here's where it lives memory" table it adds stuff to as it
> grabs symbols.

(no, that's not how that works. how that works is a lot sillier and
less efficient than you'd assume. one of these days i'd really like to
aggressively break backwards compatibility with the 1990s here, and
have Android implement this how people assume it works [where you'd
only look for memset, say, in libc].)

> There is something called "lazy binding" which means it can defer
> loading symbols until they're accessed the same way the MMU can defer faulting
> in physical pages for a mapping, and I totally forget what that looks like in
> the GOT and PLT but you can set LD_BIND_NOW to force it to resolve everything up
> front.

(Android doesn't have lazy binding, and it looks like the desktop
linuxes are starting to think about going that way too.)

> You can also point LD_PRELOAD at a (space or colon separated) list of libraries
> to load before loading any other library, which lets you override any function,
> which is how I used to get vim to STOP DOING GRATUITOUS SYNC CALLS EVERY 100
> CHARACTERS I TYPE ala:
>
> $ cat /home/old/2012/thwim.c
> // Stub out all the "sync" variants, for projects that regularly pause
> // waiting for nonessential data to hit physical media.  (On a loaded system,
> // this can easily be a 30 second wait.)
>
> // cc -fpic -shared thwim.c -o thwim.so
> // LD_PRELOAD=/usr/local/lib/thwim.so vim
>
> int fsync(int fd) { return 0; }
> int fdatasync(int fd) { return 0; }
> void sync(void) { return; }
>
> Oddly enough, I vaguely recall the PLT and GOT being new inventions (well, like
> 20 years ago now). Way back in the dark ages the linker (ld) would turn each
> reference to a given symbol into a linked list (where the pointer for the access
> wouldn't point to the symbol, but would point to the location of the NEXT access
> to it), with the ELF table entry for the unresolved dynamic symbol pointing to
> the first access (head of that particular linked list). Then the dynamic linker
> (ld.so) would create the .text mapping via mmap(MAP_PRIVATE, PROT_WRITE) and go
> through those linked lists at load time (lazy binding wasn't an option yet) to
> replace the address of each as yet unresolved jump instruction with where it had
> dynamically loaded that function or global variable, and then turn the mapping
> read-only when it was done (don't ask me how: neither mremap() nor madvise()
> currently offer an obvious way, possibly mmap(MAP_FIXED) over itself got special
> cased? This is back when the stack was still executable, and also setting up
> these mappings was black magic happening inside the dynamic linker back before I
> ever looked at its source...)
>
> The real problem with this approach is every time you modify a writeable
> MAP_PRIVATE page you dirty it, breaking the shared mapping and doing a
> copy-on-write to create your private copy. And since accesses to dynamic symbols
> were scattered all over the code, this dirtied a LOT of pages.
>
> I learned about this because embedded developers hated it. Even when a Linux
> desktop system only had 16 megs of ram nobody cared THAT much about an extra
> 128k of memory getting consumed by a process, but embedded systems cared about
> saving individual pages. This was one of the original reasons busybox was a
> single binary that could be statically linked, because then if the stars aligned
> you only needed THREE PAGES of memory to spawn a new instance of a command line
> "true". Dynamic linking was just way too expensive for embedded systems to use,
> because the binary may be smaller but the runtime memory usage was way bigger,
> due to breaking sharing on the .text pages. And this affects system performance
> by thrashing the CPU cache, or back at the time the memory bus. I wrote an
> explanation about this for The Motley Fool in a previous life (long story):
>
> https://www.fool.com/archive/portfolios/rulemaker/2000/02/23/inside-intel-again-cold-hard-cache.aspx
>
> That said, doing this _can_ still be an optimization, because with a PLT "struct
> walrus *potato;" it's actually (struct walrus *)got[POTATO_IDX] but you avoid an
> extra dereference by NOT bouncing off the PLT or GOT, and instead patching the
> source address to go directly to the destination. These days with instruction
> reordering and speculative execution down insanely deep pipelines what you'd
> actually be saving is L1 cache lines, and you probably have to benchmark it to
> see which is a win. To be honest I learned about this stuff by asking dumb
> questions on mailing lists over many years:
>
> https://www.uwsg.indiana.edu/hypermail/linux/kernel/0309.1/0716.html

(interestingly, the PLT "won" to such an extent that there's
effectively a register or two reserved for the calculation in each
stub in any modern cpu's set of 32 registers. that's what x16/x17 are
for arm64, for example.)

> And by this point I expect Elliott has a backlog of facepalms from my attempts
> to explain...
>
> >> yes "ELF format" is like "ATM machine"
> >
> > where I use my PIN code?
> >
> > One bit I can contribute: BSS is an assembler directive dating back at
> > least to the 1960s and probably earlier (don't ask me how I know). It was
> > used to reserve uninitialized space; BSSZ was used to reserve space zeroed
> > out at load time. Don't know if it's in any current assemblers.
> >
> > I tried inserting a printf of sizeof(TT) and find that it does report only
> > the global size of my own toy.
>
> Because TT is #defined as the specific struct out of the union. You #define
> FOR_thingy and then #include "toys.h" and that (eventually) pulls in
> generated/flags.h which does:
>
> #ifdef FOR_acpi
> #define CLEANUP_acpi
> #ifndef TT
> #define TT this.acpi
> #endif
> ...
> #endif
>
> (Which is a bad example because it doesn't have a GLOBALS block so TT is defined
> to something that doesn't exist, but the command never uses it so that doesn't
> cause a problem...)
>
> Anyway, the first time TT is #defined to this.thingy, and "this" is the union at
> the end of generated/globals. In that file, each GLOBALS() block gets turned
> into a struct FILENAME_data { } block, and then the union at the end has an
> instance of that struct added to it, all next to each other. (This is generated
> by scripts/make.sh currently line 247-ish. Its severeal calls to sed, except
> macintosh sed is trash and instead of replacing it they install "gsed" alongside
> it, so $SED points to the usable sed. Sigh...
>
> > I should have tried that before I asked
> > about it, and looked at how TT is defined. (I was thinking it was the
> > entire "this" union but obviously it could not be, given how globals are
> > accessed in each toy. Braino...)
>
> It just means measuring that doesn't say how much space the union is actually using.
>
> The problem is generating these headers doesn't depend on the current .config:
> it goes through and seds out ALL the GLOBALS() blocks, and generates the
> corresponding struct definitions with an instance of each in the union, and that
> means the union size is the high water mark of every GLOBALS() block, which is
> currently "ip" out of pending even if it's not enabled.
>
> What I need to do is put USE_FILENAME() macros around each union instance. (The
> name of the file and the name of the first command in the file aren't exactly
> independent. I've tried to unstick them a bit, but different bits of plumbing
> look at different things and if the name of the first command in a file isn't
> the same as the name of the file, more than one thing already gets unhappy.)
>
> The cleanup when you do #define FOR_othercommand and #include
> "generated/flags.h" state transitions isn't perfect either: you'll notice above
> the #define TT has an #ifndef around it, it only does it the FIRST time. A file
> can only have one GLOBALS() block (which is why some of them, like the one in
> ps.c, has a union at the start where the different commands option strings can
> populate different arguments), and the struct created by that GLOBALS block is
> named after the file it's in. Which means the first #define FOR_filename is
> theoretically redundant, but the builtin __FILE__ macro has a path and extension
> on the filename, and the preprocessor isn't smart enough to do "basename -s .c"
> or similar like I can in shell script, and TRYING to get the preprocessor to do
> anything fancy is a bad idea.
>
> So I specify some things redundantly, by hand. Which have to match up with other
> things. First command needs to be named the same as the file it's in. Oh well,
> adding another of that doesn't make it WORSE...
>
> > I know you aren't too big on using "const", but you said (implied?) it
> > could put data into the rodata section. For example, would it be
> > beneficial to do this:
> >
> > static char const * const msg = "a message";
> > static char const * const msgs[] = { "msg1", "msg2", 0 };
>
> The first one, you don't need the pointer. The string constant already resolves
> to a pointer.
>
> The second one, the problem is it's still a named symbol. I think we went over
> that in another thread? (If not ask again, but I'm tried right now...)
>
> > Ray
>
> Footnote: binflat is to fdpic what a.out is to elf, and a.out is toast but flat
> still has a few users, mostly old hardware that hasn't implemented fdpic support
> yet. (Because it uses 3 extra registers, and each new target has to define a new
> ABI specifying which registers are used for which segment, and then tweak the
> compiler to output the right code, and for stuff like "coldfire" nobody seems to
> have bothered. (Coldfire was a nommu m68k variant with a small number of
> high-volume users, so it wasn't widely used but it shipped a LOT of units.
> Motorola sold its chip division to Freescale which was bought by NXP, thus
> https://en.wikipedia.org/wiki/NXP_ColdFire).
>
> The a.out format is what Bell Labs Unix used back in the 1970s. (If you didn't
> tell Dennis Ritchie's original C compiler what to call the output file it wrote
> to "a.out" as the default, and linkers still do that today.) And that format
> worked GREAT on a PDP-11 with static binaries, but trying to make dynamic
> linking work with it on 32 bit systems was a mess.
>
> Linux journal did several articles during the original a.out->ELF transition
> back in 1995:
>
> https://www.linuxjournal.com/article/1139
> https://www.linuxjournal.com/article/1059
> https://www.linuxjournal.com/article/1060
>
> And the Linux Documentation project did a writeup:
>
> http://web.mit.edu/linux/redhat/redhat-3.0.3/i386/doc/HTML/ldp/ELF-HOWTO-1.html
>
> The libc5 (Linus's libc)->libc6 (Ulrich Drepper's "glibc 2.0" fork,
> http://freesoftwaremagazine.com/articles/history_of_glibc_and_linux_libc/)
> transition happened around the same time, although that was driven by adding
> thread support (because Java can't NOT use threads, because James Gosling
> refused to wrap poll and select so the only way you can do nonblocking I/O in
> Java is spawn a thread to block for you, which had BUCKETS of problems and the
> linux kernel guys spent a good decade trying to whack-a-mole them. Threads were
> invented because Solaris's fork() was very slow, as explained in the famous post
> where Sun engineer Bryan Cantrill convinced the entire Linux community that Sun
> was too dumb to live with one sentence:
> https://landley.net/history/mirror/linux/kissedagirl.html .)
>
> ELF was pretty much ubiquitous by Y2K, and Linux finally got around to
> officially deprecating the old one in 2019:
>
> https://www.linuxjournal.com/content/deprecating-aout-binaries
>
> But they didn't remove binflt because not every nommu system has implemented
> support for fdpic yet, because most embedded developers won't go near the
> wretched hive of scum and villainy that is the linux kernel mailing list, and
> half of them are still using the 2.6 kernel (or even 2.4, and in extreme cases
> 2.2 or 2.0) anyway. I've done a lot of work to try to make it POSSIBLE to use
> current stuff in that space, but I'm up against the fact that an allnoconfig 2.2
> kernel was like 200k and an allnoconfig 6.0 kernel is well over a megabyte, so...
>
> If you want to know about the guts of ELF, the book "linkers and loaders" is
> still definitive (and BONE DRY. I had to read it long ago and it was SO BORING).
> I have a paper copy but the author made it available free online:
>
> https://www.iecc.com/linker/
>
> Although you might want the book in a more modern format:
>
> https://github.com/last-genius/comp_arch_list/blob/master/books/Linkers%20and%20Loaders%20by%20John%20R.%20Levine.pdf
>
> A subset of the horrific standards documents are at:
>
> https://refspecs.linuxfoundation.org/
>
> I had to read while trying to debug issues with a new dynamic linker
> implementation for a new architecture. It's also useful if you're maintaining
> your own tinycc fork (https://landley.net/hg/tinycc) or debugging the guts of
> qemu (https://lists.nongnu.org/archive/html/qemu-devel/2010-02/msg00973.html).
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net