[Toybox] Impact of global struct size

Fri Jan 5 22:51:29 PST 2024

On 1/2/24 16:58, Ray Gardner wrote:
> On Mon, Jan 1, 2024 at 1:39 PM Rob Landley <rob at landley.net> wrote:
>> ... [ a very long and detailed reply ] ...
> 
> Rob, thank you for the "GIANT INFODUMP", and I mean that sincerely. It
> took me a while to read it; it must have taken quite a while to write it.

It did, but you asked. And posting it to the list means I can refer back to it,
and/or more people can learn it so they don't have to ask me. :)

You know how I say I document compulsively? Combine stream of consciousness
infodump with Pascal's Apology:

https://www.npr.org/sections/13.7/2014/02/03/270680304/this-could-have-been-shorter

And you get documentation. Editing it DOWN, figuring out a non-dupliciative
sequence where I'm not assuming knowledge I haven't explained yet, and chopping
it into bite-sized chunks, is the hard part.

Blathering like this is easy. Turning into a FAQ entry or something is hard.

> A lot of info on kernel-level memory management, I think I got about 90%
> of it but I'll have to look up some stuff (PLT, GOT, ...).

Procedure Linkage Table and Global Offset Table. The first tracks where
dynamically linked functions live, the second tracks dynamically linked global
variables live.

Ok, take everything here with a grain of salt because I last had to know this in
detail back around 2010 and I largely avoid dynamic linking when I can because
is really messy. I am PROBABLY getting this wrong, but off the top of my head:

[Note: Elliott started another thread while I was traveling with this
half-finished, and he can correct most of the stuff I get wrong. I'm also
pointing you at where the kernel code lives, and other references.]

When you exec() a file, Linux checks the executable bit (if it's not executable
it won't even try, and the suid and sgid bits get handled here too), and then
does some simple type identification on it, which involves waving it at the
"binary format loaders" to see if any claim it. (This is a bit like filesystem
probe functions during mount, only for file data instead of block device data.)

$ ls linux/fs/binfmt*
linux/fs/binfmt_elf.c        linux/fs/binfmt_flat.c
linux/fs/binfmt_elf_fdpic.c  linux/fs/binfmt_misc.c
linux/fs/binfmt_elf_test.c   linux/fs/binfmt_script.c

(Sadly, these can all be kernel modules so you can DYNAMICALLY LOAD a BINARY
FORMAT LOADER which is just wrong.)

The main one that gets 90% of the use is binfmt_elf, the kernel's ELF executable
loader. We'll come back to that.

The "binfmt_script" one gets almost all of the rest of the use: it checks if the
first two bytes of the file are #! and if so it re-runs the exec call with the
/path/after/that as the new file argument, and inserting everything after the
first space in that line as argv[1] with the remaining arguments (if any) bumped
to argv[2] and friends. This is how shell scripts work, and the mechanism perl
and python and so on inherited. It's also how you can use tinycc to run C as a
scripting language with the first line being "#!/usr/bin/tinycc -run" which
turns into "tinycc -run file.c" so it compiles, links, and executes it instead
of writing it out to a file.

And yes, it catches:

$ echo \#\!$(readlink -f bang.sh) > bang.sh && chmod +x bang.sh && ./bang.sh
bash: ./bang.sh: /home/landley/bang.sh: bad interpreter: Too many levels of
symbolic links

The elf_fdpic one is the nommu variant of elf, which REALLY SHOULD be a couple
of if () statements in the same file but they did an ext2/ext3 thing and
duplicated the file, but unlike deleting both of those and just having ext4
handle all three variants of the same format in modern systems, the linux-kernel
guys never went back and cleaned that up because linux-kernel developmet is
almost completely ossified and bureuacratically paralyzed these days. Oh well.

You can ignore binfmt_flat as obsolete. It was the nommu fork of binfmt_aout
which was the old executable format before everybody switched to ELF in 1996.
There was a binfmt_aout.c which got removed in kernel commit 987f20a9dcce in
2022. I wrote more but am making it a FOOTNOTE. (See footnote.)

People mostly stopped writing new ones once binfmt_misc was invented, because
that sucker's programmable. It's basically a binfmt_script that can be
programmed (via /proc) to recognize arbitrary file formats and run arbitrary
commands to handle them:

https://docs.kernel.org/admin-guide/binfmt-misc.html

If you've ever run an arm binary on x86 and it magically called qemu application
emulation for you, that's because some init script setup a binfmt_misc
association to do that.

I have no idea what binfmt_elf_test is, it was introduced recently (commit
9e1a3ce0a952 in 2022) and from the commit message the Linux Test Project people
crapping unnecessary complexity into the mainline kernel for no obvious reason.
It's the kernel equivalent of checking in debug printfs. Make an EFFORT to
ignore that one, it's NOT REAL.

Ok, so back to the ELF loader. We've more or less covered static linking
earlier, where the loader parses the tables and does a bunch of mmap() and puts
data in the right place and jumps to the program's _start symbol (actually the
"entry point address" field in the initial 127 byte ELF header struct at the
start of the file, but by default the linker will stick the address of the
_start symbol in there. You can override it with "ld -e symbolname" if you
really want to, but you're probably skipping various setup libc kind of expects
if you do that. main() actually gets called from _start() and returns to it, and
_start lives in crt1.o ala:

  readelf -a /usr/lib/*/crt1.o | less

Anyway, tangent.

Dynamically linked Elf binaries work a bit like binfmt_script or binfmt_misc
above, in that instead of executing the binary directly like it does with static
linking, for dynamic binaries binfmt_elf will find a special entry in the ELF
headers that points at a DIFFERENT program, and runs that instead:

$ readelf -a /bin/ls | grep Requesting
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

This is called the "dynamic linker/loader" and has a man page: "man 8 ld.so".
The one in glibc is INSANE. The one in musl-libc is... less insane. Did you know
that the glibc's "ldd" program that lists the libraries an ELF file is linked
against is ACTUALLY A SHELL SCRIPT, and all it really does is set the
LD_TRACE_LOADED_OBJECTS environment variable? AND if you set that yourself you
can't run any dynamic elf binaries which has been used in all SORTS of security
exploits (the next command won't run, just spam some stuff to stdout and return
success):

$ LD_TRACE_LOADED_OBJECTS=1 /bin/ls
	linux-vdso.so.1 (0x00007ffd24b48000)
	libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fce46d98000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce46bd8000)
	libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fce46b64000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce46b5f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fce46fe2000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce46b3e000)

And yet despite that, /usr/bin/ldd is 193 lines long because GNU.

Anyway, GOT and PLT are arrays assembled/populated by the dynamic linker when
it's loading and resolving the dynamic symbols for a given program. It's the "I
already loaded that, here's where it lives memory" table it adds stuff to as it
grabs symbols. There is something called "lazy binding" which means it can defer
loading symbols until they're accessed the same way the MMU can defer faulting
in physical pages for a mapping, and I totally forget what that looks like in
the GOT and PLT but you can set LD_BIND_NOW to force it to resolve everything up
front.

You can also point LD_PRELOAD at a (space or colon separated) list of libraries
to load before loading any other library, which lets you override any function,
which is how I used to get vim to STOP DOING GRATUITOUS SYNC CALLS EVERY 100
CHARACTERS I TYPE ala:

$ cat /home/old/2012/thwim.c
// Stub out all the "sync" variants, for projects that regularly pause
// waiting for nonessential data to hit physical media.  (On a loaded system,
// this can easily be a 30 second wait.)

// cc -fpic -shared thwim.c -o thwim.so
// LD_PRELOAD=/usr/local/lib/thwim.so vim

int fsync(int fd) { return 0; }
int fdatasync(int fd) { return 0; }
void sync(void) { return; }

Oddly enough, I vaguely recall the PLT and GOT being new inventions (well, like
20 years ago now). Way back in the dark ages the linker (ld) would turn each
reference to a given symbol into a linked list (where the pointer for the access
wouldn't point to the symbol, but would point to the location of the NEXT access
to it), with the ELF table entry for the unresolved dynamic symbol pointing to
the first access (head of that particular linked list). Then the dynamic linker
(ld.so) would create the .text mapping via mmap(MAP_PRIVATE, PROT_WRITE) and go
through those linked lists at load time (lazy binding wasn't an option yet) to
replace the address of each as yet unresolved jump instruction with where it had
dynamically loaded that function or global variable, and then turn the mapping
read-only when it was done (don't ask me how: neither mremap() nor madvise()
currently offer an obvious way, possibly mmap(MAP_FIXED) over itself got special
cased? This is back when the stack was still executable, and also setting up
these mappings was black magic happening inside the dynamic linker back before I
ever looked at its source...)

The real problem with this approach is every time you modify a writeable
MAP_PRIVATE page you dirty it, breaking the shared mapping and doing a
copy-on-write to create your private copy. And since accesses to dynamic symbols
were scattered all over the code, this dirtied a LOT of pages.

I learned about this because embedded developers hated it. Even when a Linux
desktop system only had 16 megs of ram nobody cared THAT much about an extra
128k of memory getting consumed by a process, but embedded systems cared about
saving individual pages. This was one of the original reasons busybox was a
single binary that could be statically linked, because then if the stars aligned
you only needed THREE PAGES of memory to spawn a new instance of a command line
"true". Dynamic linking was just way too expensive for embedded systems to use,
because the binary may be smaller but the runtime memory usage was way bigger,
due to breaking sharing on the .text pages. And this affects system performance
by thrashing the CPU cache, or back at the time the memory bus. I wrote an
explanation about this for The Motley Fool in a previous life (long story):

https://www.fool.com/archive/portfolios/rulemaker/2000/02/23/inside-intel-again-cold-hard-cache.aspx

That said, doing this _can_ still be an optimization, because with a PLT "struct
walrus *potato;" it's actually (struct walrus *)got[POTATO_IDX] but you avoid an
extra dereference by NOT bouncing off the PLT or GOT, and instead patching the
source address to go directly to the destination. These days with instruction
reordering and speculative execution down insanely deep pipelines what you'd
actually be saving is L1 cache lines, and you probably have to benchmark it to
see which is a win. To be honest I learned about this stuff by asking dumb
questions on mailing lists over many years:

https://www.uwsg.indiana.edu/hypermail/linux/kernel/0309.1/0716.html

And by this point I expect Elliott has a backlog of facepalms from my attempts
to explain...

>> yes "ELF format" is like "ATM machine"
> 
> where I use my PIN code?
> 
> One bit I can contribute: BSS is an assembler directive dating back at
> least to the 1960s and probably earlier (don't ask me how I know). It was
> used to reserve uninitialized space; BSSZ was used to reserve space zeroed
> out at load time. Don't know if it's in any current assemblers.
> 
> I tried inserting a printf of sizeof(TT) and find that it does report only
> the global size of my own toy.

Because TT is #defined as the specific struct out of the union. You #define
FOR_thingy and then #include "toys.h" and that (eventually) pulls in
generated/flags.h which does:

#ifdef FOR_acpi
#define CLEANUP_acpi
#ifndef TT
#define TT this.acpi
#endif
...
#endif

(Which is a bad example because it doesn't have a GLOBALS block so TT is defined
to something that doesn't exist, but the command never uses it so that doesn't
cause a problem...)

Anyway, the first time TT is #defined to this.thingy, and "this" is the union at
the end of generated/globals. In that file, each GLOBALS() block gets turned
into a struct FILENAME_data { } block, and then the union at the end has an
instance of that struct added to it, all next to each other. (This is generated
by scripts/make.sh currently line 247-ish. Its severeal calls to sed, except
macintosh sed is trash and instead of replacing it they install "gsed" alongside
it, so $SED points to the usable sed. Sigh...

> I should have tried that before I asked
> about it, and looked at how TT is defined. (I was thinking it was the
> entire "this" union but obviously it could not be, given how globals are
> accessed in each toy. Braino...)

It just means measuring that doesn't say how much space the union is actually using.

The problem is generating these headers doesn't depend on the current .config:
it goes through and seds out ALL the GLOBALS() blocks, and generates the
corresponding struct definitions with an instance of each in the union, and that
means the union size is the high water mark of every GLOBALS() block, which is
currently "ip" out of pending even if it's not enabled.

What I need to do is put USE_FILENAME() macros around each union instance. (The
name of the file and the name of the first command in the file aren't exactly
independent. I've tried to unstick them a bit, but different bits of plumbing
look at different things and if the name of the first command in a file isn't
the same as the name of the file, more than one thing already gets unhappy.)

The cleanup when you do #define FOR_othercommand and #include
"generated/flags.h" state transitions isn't perfect either: you'll notice above
the #define TT has an #ifndef around it, it only does it the FIRST time. A file
can only have one GLOBALS() block (which is why some of them, like the one in
ps.c, has a union at the start where the different commands option strings can
populate different arguments), and the struct created by that GLOBALS block is
named after the file it's in. Which means the first #define FOR_filename is
theoretically redundant, but the builtin __FILE__ macro has a path and extension
on the filename, and the preprocessor isn't smart enough to do "basename -s .c"
or similar like I can in shell script, and TRYING to get the preprocessor to do
anything fancy is a bad idea.

So I specify some things redundantly, by hand. Which have to match up with other
things. First command needs to be named the same as the file it's in. Oh well,
adding another of that doesn't make it WORSE...

> I know you aren't too big on using "const", but you said (implied?) it
> could put data into the rodata section. For example, would it be
> beneficial to do this:
> 
> static char const * const msg = "a message";
> static char const * const msgs[] = { "msg1", "msg2", 0 };

The first one, you don't need the pointer. The string constant already resolves
to a pointer.

The second one, the problem is it's still a named symbol. I think we went over
that in another thread? (If not ask again, but I'm tried right now...)

> Ray

Footnote: binflat is to fdpic what a.out is to elf, and a.out is toast but flat
still has a few users, mostly old hardware that hasn't implemented fdpic support
yet. (Because it uses 3 extra registers, and each new target has to define a new
ABI specifying which registers are used for which segment, and then tweak the
compiler to output the right code, and for stuff like "coldfire" nobody seems to
have bothered. (Coldfire was a nommu m68k variant with a small number of
high-volume users, so it wasn't widely used but it shipped a LOT of units.
Motorola sold its chip division to Freescale which was bought by NXP, thus
https://en.wikipedia.org/wiki/NXP_ColdFire).

The a.out format is what Bell Labs Unix used back in the 1970s. (If you didn't
tell Dennis Ritchie's original C compiler what to call the output file it wrote
to "a.out" as the default, and linkers still do that today.) And that format
worked GREAT on a PDP-11 with static binaries, but trying to make dynamic
linking work with it on 32 bit systems was a mess.

Linux journal did several articles during the original a.out->ELF transition
back in 1995:

https://www.linuxjournal.com/article/1139
https://www.linuxjournal.com/article/1059
https://www.linuxjournal.com/article/1060

And the Linux Documentation project did a writeup:

http://web.mit.edu/linux/redhat/redhat-3.0.3/i386/doc/HTML/ldp/ELF-HOWTO-1.html

The libc5 (Linus's libc)->libc6 (Ulrich Drepper's "glibc 2.0" fork,
http://freesoftwaremagazine.com/articles/history_of_glibc_and_linux_libc/)
transition happened around the same time, although that was driven by adding
thread support (because Java can't NOT use threads, because James Gosling
refused to wrap poll and select so the only way you can do nonblocking I/O in
Java is spawn a thread to block for you, which had BUCKETS of problems and the
linux kernel guys spent a good decade trying to whack-a-mole them. Threads were
invented because Solaris's fork() was very slow, as explained in the famous post
where Sun engineer Bryan Cantrill convinced the entire Linux community that Sun
was too dumb to live with one sentence:
https://landley.net/history/mirror/linux/kissedagirl.html .)

ELF was pretty much ubiquitous by Y2K, and Linux finally got around to
officially deprecating the old one in 2019:

https://www.linuxjournal.com/content/deprecating-aout-binaries

But they didn't remove binflt because not every nommu system has implemented
support for fdpic yet, because most embedded developers won't go near the
wretched hive of scum and villainy that is the linux kernel mailing list, and
half of them are still using the 2.6 kernel (or even 2.4, and in extreme cases
2.2 or 2.0) anyway. I've done a lot of work to try to make it POSSIBLE to use
current stuff in that space, but I'm up against the fact that an allnoconfig 2.2
kernel was like 200k and an allnoconfig 6.0 kernel is well over a megabyte, so...

If you want to know about the guts of ELF, the book "linkers and loaders" is
still definitive (and BONE DRY. I had to read it long ago and it was SO BORING).
I have a paper copy but the author made it available free online:

https://www.iecc.com/linker/

Although you might want the book in a more modern format:

https://github.com/last-genius/comp_arch_list/blob/master/books/Linkers%20and%20Loaders%20by%20John%20R.%20Levine.pdf

A subset of the horrific standards documents are at:

https://refspecs.linuxfoundation.org/

I had to read while trying to debug issues with a new dynamic linker
implementation for a new architecture. It's also useful if you're maintaining
your own tinycc fork (https://landley.net/hg/tinycc) or debugging the guts of
qemu (https://lists.nongnu.org/archive/html/qemu-devel/2010-02/msg00973.html).