[Toybox] Impact of global struct size

Mon Jan 1 12:45:43 PST 2024

On 12/30/23 18:10, Ray Gardner wrote:
> I am having a bit of trouble understanding the impact of globals.
> 
> There are the probes GLOBALS and findglobals to see what the space usage
> is for globals. The output of both show that "this" is up to 8232 bytes,
> due to the "ip" toy using 8232 of global space.

Which is in pending for a reason.

> The blog entry of 31 August 2023 ends with some discussion of which
> commands take up the most global space. It says "Everything "tr" and
> earlier is reasonably sized, and "ip" and "telnet" are in pending."

I sorted them by size and cut and pasted the end of the list, starting with "tr".

Commands in pending haven't been (fully) cleaned up yet, so problems in them
aren't yet confirmed to be a design problem requiring heavy lifting. Most likely
something easily fixable I just haven't done the janitorial work for yet.

> I inferred that this means commands in pending are less important here,
> but they still seem to take up space in "this".

Only when enabled, which they aren't in defconfig. If you don't switch it on in
config, then it doesn't get build, meaning it doesn't take up any space in the
resulting toybox binary.

> How important is the space here? "tr" was 520 then, cksum was 1024. How
> big is too big?

Mostly this is an issue for embedded systems. I doubt android's going to care.

Sorry for the delay replying, I can't figure out how to explain this without a
GIANT INFODUMP of backstory. The tl;dr is your read-only data is shared between
all instances of the program, but your _writeable_ data needs a separate copy
for each instance of the program that's running, and that includes every global
variable you MIGHT write to. The physical pages can be demand-faulted on systems
with an MMU (although each fault gets rounded up to page size), but without an
mmu it's LD_BIND_NOW and then some. See "man 8 ld.so" if you didn't get that
reference...)

Ok, backstory: since 1996 modern Linux executables are stored in ELF format
(Executable Linking Format, yes "ELF format" is like "ATM machine"). It's an
archive format like zip or tar, except what it stores is (among other things) a
list of "sections" each containing a list of "symbols". Your linker puts this
together from the .o files produced by the compiler.

Statically linked processes have six main memory mappings, four of which are
initialized by the kernel's ELF loader (fs/binfmt_elf.c) from sections in the
ELF file, and the other two are generated at runtime. All six of these are
created by the kernel during the execve(2) system call, mostly l wanders through
fs/binfmt_elf.c (or fs/binfmt_fdpic.c which is kind of an ext2/ext3 thing that's
MOSTLY the same with minor variants and REALLY SHOULD be the same file but isn't
because kernel development became proctologically recursive some years ago).

You can look at /proc/self/maps (and /proc/self/smaps, and
/proc/self/smaps_rollup) to see them for a running process (replace "self" with
any running PID, self is a symlink to your current PID). The six sections are:

  text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC)
  rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, PROT_READ)
  data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, PROT_WRITE)
  bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE)
  stack - function call stack, also contains environment data
  heap - backing store for malloc() and free()

The first three of those literally exist in the ELF file, as in it mmap()s a
block of data out of the file at a starting offset, and the memory is thus
automatically populated with data from the file. The text and rodata ones don't
really care if it's MAP_PRIVATE or MAP_SHARED because they can never write
anything back to the file, but the data one cares that it's MAP_PRIVATE: any
changes stay local and do NOT get written back to the file. And the bss is an
anonymous mapping so starts zeroed, the file doesn't bother wasting space on a
run of zeroes when the OS can just provide that on request. (It stands for Block
Starting Symbol which I assume meant something useful 40 years ago on DEC hardware.)

All four of those ELF sections (text, rodata, data, bss) are each treated as a
giant struct under the covers, because that's how C thinks. Every time you
reference a variable the C code goes "I have a pointer to the start of this, and
I have an offset into it where this particular symbol lives within that segment,
and I know the type and thus size of the variable living at that offset" every
time you reference a symbol that lives there.

The remaining two memory blocks aren't part of ELF, but they're needed at runtime.

The stack is also set up by the kernel, and is funny in three ways:

1) it has environment data at the end (so all your inherited environment
variables, and your argv[] arguments, plus an array of pointers to the start of
each string which is what char *argv[] and char *environ[] actually point to.
The kernel's task struct also used to live there, but these days there's a
separate "kernel stack" and I'd have to look up where things physically are now
and what's user visible.

2) It grows down, not up. (Except on pa-risc, but that's essentially extinct.)
The pointer starts at the end of the stack space (well, right before the
environment data), and each function call SUBTRACTS from it, and each function
call adds back to it.

Your local variables are basically ANOTHER struct where "I have a register
pointing to a location, and each name is an offset+type from that register'
(it's how C does everything). When you make a function call, the pointer moves
to fresh empty stack space so the local variables from last time aren't modified
by the current function, and then when you return it moves it back.

By moving down, a function can grab however much stack space it needs in one
place at the start of the function (I need X bytes, so sp -= X), and then that
"pointer+offset" logic is going into the protected space it's reserved for
itself. If the stack grew up, it would either have to SUBTRACT the offset to use
space it grabbed at the start, or else hold off to protect _this_ function's
space right before each function call (duplicated code, and also using
unprotected space so outside observers like debuggers would never _really_ know
how much of the stack was currently in use). The function can implement "return"
as shared code where it jumps to a single "add this many bytes back to the stack
pointer" instruction on the way out. Add each function call pushing the
instruction pointer to the stack before jumping to the new function, and then
you just "pop ip" after fixing the stack pointer and you've returned from your
function.

3) The stack generally has _two_ pointers, a "stack pointer" and a "base
pointer" which I always get confused. One of them points to the start of the
mapping (kinda important to keep track of where your mappings are), and the
other one moves (gets subtracted from and added to and offset to access local
variables).

All this is ignoring dynamic linking, in which case EACH library has those first
four sections (plus a PLT and GOT which have to nest since the shared libraries
are THEMSELVES dynamically linked, which is why you need to run ldd recursively
when harvesting binaries, although what it does to them at runtime I try not to
examine too closely after eating). There should still only be one stack and heap
shared by each process though.

If you launch dozens of instances of the same program, the read only sections
(text and rodata) are shared between all the instances. (This is why nommu
systems needed to invent fdpic: in conventional ELF everything uses absolute
addresses, which is find when you've got an MMU because each process has its own
virtual address range starting at zero. (Generally libc or something will mmap()
about 64k of "cannot read, cannot write, cannot execute" memory there so any
attempt to dereference a NULL pointer segfaults, but other than that...)

But shared libraries need to move so they can fit around stuff. Back in the
a.out days each shared library was also linked at an absolute address (just one
well above zero, out of the way of most programs), meaning when putting together
a system you needed a registry of what addresses were used by each library, and
you'd have to supply an address range to each library you were building as part
of the compiler options (or linker script or however that build did it). This
sucked tremendously.

So ELF PIC (Position Independent Code) was invented, where everything is offset
from a single pointer (all the segments are glued together one after the other,
so each segment has a "starting offset". Register + segment offset + variable
offset = location of variable). PIC was invented for shared libraries, but the
security nuts immediately decided they wanted executables to work like that too
(so exploit code that jumped to absolute addresses where it "knew" a function
lived would stop working. Moving shared libraries around helped, but moving the
executable's own code around nerfed more attack surface). So they invented PIE
(Position Independent Executables) which means your executable is compiled with
-fpic and then your startup code (crt0.o and friends) does some shared library
setup. (This may also have involved kernel changes, I knew back around 2007 but
between me forgetting and the plumbing being a moving target...)

(Note: legacy 32 bit x86 does not have enough registers, so anything that uses
one more register leaves less for the rest of the code resulting in extra stack
spills and such, although they were already shoving %fs and %gs or some such in
there decades ago and x86-64 added 8 more and x32 is a hybrid that can still use
those 8 extra in 32 bit mode... Anyway, there used to be strong reasons to skimp
on register usage, and these days not so much. Outside of legacy x86,
architectures with FEW registers have 16, and the rest have 32. So eating a
half-dozen for "API" is acceptable.)

So NOMMU systems: processors that do not have a memory management unit are
generally called "microcontrollers" instead of "microprocessors", and they use
physical addresses for everything. This has some advantages: not only is it less
circuitry (less power consumption), but it's more memory efficient because there
are no page tables (Vitaly Wool gave a talk at ELC in 2015 where he had Linux
running in 256k of SRAM, and everything else running out of ROM (well, nor
flash), which was only possible because he didn't spend any memory on page
tables) and easier to get hard realtime behavior because there are no soft faults...

NOMMU systems have been analogized to bacteria, fungi, and insects: there are
more bacterial cells on the planet than mammal cells by a VERY LARGER MARGIN.
But they're mostly invisible to the elephants and dinosaurs of the world. NOMMU
systems are EVERYWHERE. Mostly Linux is too big to play in that space, but even
a small slice of it is more deployments than everything else Linux does combined
(yes including phones).

A MMU has an extra set of registers called a TLB (Translation Lookaside Buffer)
which translates each memory access (this memory address you're using is
actually THIS physical memory address). Each TLB entry is a starting address, a
range, and some permission flags (this entry is good for read, write, or execute
attempts). The TLB entries are all checked in parallel in a single clock cycle,
so there are never many of them (the wiring increases exponentially the more you
have). When a memory access the code tries to do ISN'T in the TLB, the processor
generates an interrupt ("fault" to hardware people) and execution jumps to the
interrupt handler, which traverses a data structure called a "page table" to
figure out what that memory access should do. It can swap out one of the TLB
entries to stick in the appropriate virtual/physical translation and return from
the interrupt (this is called a "soft fault"). Or it can edit the page table to
allocate more physical memory (if this adds a zeroed page it's deferred
allocation, if it makes a copy of previously shared memory you're now writing to
it's copy-on-write). If it hasn't got enough memory to satisfy the attempted
access it can evict some other physical page (writing its contents to swap),
generally suspending the process and letting the scheduler pick something else
to run until the write's finished and it can reuse that page. Similarly, if that
page isn't available because it was swapped out (or is mmaped from a file but
not present in memory) the fault handler can schedule a load from disk (or
network filesystem!) and again suspend the process until that completes and the
data's ready to edit the page table, fix up the TLB, and resume from the fault.

Some old processors had page fault handling in hardware. This sucked rocks, and
Linus yelled at the PowerPC guys somewhat extensively about this 20 years ago.
Page fault handling in software is sane, page fault handling in hardware isn't.
This does mean you need to tie up a TLB entry pointing at your page table, and
another pointing at your memory fault handler code, so the software can run it.
(Or you stick the processor into "physical addressing" mode to bypass the TLB
during the interrupt so you're running without the MMU and all memory addresses
are physical addresses. Different processors do it different ways.)

A NOMMU system hasn't got a TLB. Sometimes it will have "range registers", which
are similar in that they provide windows into physical memory where access is
allowed. The same "start, length" logic for address matching, but all it does is
allow or disallow access. (That way your userspace processes can't access each
other's memory, and cant' read or write kernel memory either.) But this isn't
very common because the MTRR plumbing is 90% of the way to a software TLB: add a
third value (an offset added to the physical address when this range matches;
remember unsigned values can wrap cleanly) and the RWX access type flags (so you
can match TYPE of access allowing read-only or non-executable memory), give the
fault handler the ability to resume the interrupted instruction after doing
fixups... and you have an MMU.

Things a NOMMU system _can't_ do:

1) Remap addresses.

On a system with an mmu, every process can have a different mapping at address
0x1000 and each sees its own unique contents there. Without an mmu, what's in
the memory is the same for everybody.

2) Collate discontiguous memory.

With an mmu, if I mmap(MAP_ANON) a megabyte long, the underlying physical pages
the page table slots in may be scattered all over the place, and it's not my
problem.

Without an MMU, if I want a 256k mapping I need to find 256k of contiguous
physical memory (all together in one linear chunk). And if the system's been
running for a while, I may have way more than 256k free but it's in scattered
pieces none of which is big enough to satisfy my allocation request.

Fragmentation is a much bigger problem on NOMMU systems.

3) use swap space

It's built on top of page tables, and we haven't got those. So I can't use
storage to increase my available memory: the physical memory I've got is it.

4) Demand fault

If I allocate memory, I have to have that memory NOW. I can't allocate page
table space and leave the physical memory unpopulated, and then attach pages to
it from the fault handler as the processor actually tries to read or write to
the memory.

If I want to mmap() something backed from a file, I have to load it all in
during the mmap() call. I can't detect access and load it in just the part that
was accessed from the fault handler.

I can't copy-on-write either. With an MMU, I have a writeable mapping that
starts identical to another process's memory I can point the new mapping at the
old one's physical pages and mark it read only, and then when they try to write
have the fault handler allocate a new physical page, copy the old data, and
redirect the write to the new page. But without an MMU, a separate writeable
mapping is separate physical pages right from the start, it just initializes it
with a copy of the data.

5) fork() processes.

An identical copy of this process would want to use the same addresses. If you
make copies of this process's mappings, they will live at different address
ranges. All the pointers in the new copy point into to the OLD copy's memory.

Instead nommu systems have to use vfork(), which suspends the parent until the
child either calls exec() or _exit(), because both discard all the old mappings
(and thus the parent can get them back). Note that changes to global variables
in the child before calling exec() will affect the parent's data!

Often vfork() won't even give the child a new stack, which means values written
to _local_ variables are also visible in the parent when that resumes, AND it
means the child CANNOT RETURN FROM THE FUNCTION THAT CALLED VFORK (because that
will discard the current stack frame, and then the next function call will
overwrite it with nonsense, so when the parent resumes Bad Things Happen).

6) Most mmap() flags don't work.

All mmap() really does on nommu is allocate phyiscal memory ranges so other
mappings won't use it. There's no protection or reorganization.

You can mmap(MAP_ANONYMOUS). You can only MAP_FIXED if nobody else is using that
address yet. Everything is always MAP_LOCKED and MAP_POPULATE.

You can mmap(MAP_SHARED) but _not_ with PROT_WRITE (because it has no way to
detect changed data needs to be written back: the MMU trick is to remove the
write bit from "clean" pages, take a fault the first time you try to write to
them, and let the write go through but schedule the page to be written out to
disk; there's optimizations with a second "dirty" bit to reduce the number of
faults taken without missing updates as the page is written to disk). And
remember, read-only mappings are fully populated up front, which is expensive
(and a latency spike because mmap() won't return until it's finished reading all
the data into memory).

You can't MAP_PRIVATE because that does copy-on-write when you change the
contents. You can mmap(MAP_SHARED) because "everybody sees the changes" is the
default, but only read-only mappings work.

7) Dynamically grow the stack.

Again, no faults for the interrupt handler to fix up via deferred physical page
allocation, so the stack size you request is the stack size the kernel allocates
for you in exec. And it's gotta be contiguous or the process can't START, so
ideally nommu systems use as small a stack size as possible. (This is another
extra field fdpic and binflt added: binflt as the old nommu variant of a.out,
it's obsolete, don't go there. You can set it via a linker option, or it's got a
default you can specify in the ./configure when you build your toolchain.)

The default stack size on kernels with mmu is generally 8 megabytes. Common
stack sizes on nommu range from 4k to 128k. Once upon a time toybox could MOSTLY
be made to work with 16k, although it's been a while since I've regression
tested it and the shell ain't gonna be happy. I'd do more like 64k.

NOMMU also uses its own executable formats, so you have to compile and link your
binaries differently.

You can't run most normal ELF binaries on NOMMU, because those are linked to use
absolute addresses, and if those specific memory addresses aren't available
(because the kernel's there, for example) it won't load and can't run. You MIGHT
be able to run ONE ELF binary if you prepared it specially, but can't run even a
second copy of the same one (because it would want to re-use the same addresses).

You can run PIE binairies on NOMMU, but it's not ideal. Those glue all the ELF
segments together into a contiguous chunk of memory, but have a register
pointing to the starting address. So you can run the binary, assuming you can
find a big enough contiguous chunk of memory to stick it in. (The heap and stack
can fit anywhere, those always had a register pointing to the start of each,
even back on x86.) But you need a BIG contiguous chunk, which is hard to find
once memory gets fragmented. And it's wasteful because the read-only sections
(text and rodata) can't be shared, since they're contiguous with the writeable
sections. (There's only one "start of PIE" pointer, and all 4 ELF segments are
glued together after it because the offsets are calculated the way they are in
conventional ELF, just with the addition of a base pointer instead of always
starting at 0.)

FDPIC is an ELF variant that separates the segments, which means it uses 4
registers, one for each segment. (In theory you could use one register pointing
to an array of 4 pointers and have an extra dereference on every memory access,
but nobody wants to take the speed hit. That's why "fdpic for 32 bit x86" didn't
happen.) This means your read only sections CAN be shared, and that the other
writeable segments are independently moveable and can fit into smaller scraps of
fragmented memory.

SO: cycling back to the original question:

The GLOBALS() block is the majority of the data segment. That and the stack are
the two big contiguous allocations for each new process. (The heap is also there
but it's happens later and there's stuff libc can do to make it suck less. In
theory EACH malloc() could be a separate mmap() which avoids fragmentation and
would also round up each allocation to 4k so is actually a huge net loss, but
the point is a nommu-aware allocator can break the heap up into multiple mmap()s
to mitigate fragmentation, and that can happen transparently within libc and not
be the userspace programmer's immediate problem.)

(Yes I said the kernel gives you a starting heap when I listed the 6 segments
above. It's a legacy thing from the PDP-11 days in the 1970s. "man 2 brk" is
related. I can explain, but... ask Rich Felker. I he has a rant. I believe this
is another thing the libc author can shoot in the face and make it stop; for our
purposes you can probably ignore it. Modern allocation is more or less built on
mmap(), I think.)

> As long as "this" is as big as the largest GLOBAL struct, then what is the
> point of working to reduce the global space of any command, when the space
> for "ip" is in there whether "ip" is compiled into toybox or not?

The space for "ip" shouldn't be there when ip isn't compiled in. Looks like I
should add USE() macros around each struct in union global_data in
generated/globals.h.

> What am
> I missing? Why are global structs included in globals.h for commands not
> included in a build? Or are they somehow suppressed in the build?

On a system with mmu the pages are demand faulted so the main issue there is
memory usage being rounded up to 4k.

> The globals do not seem to affect the size of the executable file, at
> least using the default build. Is the issue with "this" taking up space at
> runtime? Many commands must surely allocate much more space dynamically
> anyway.
> 
> I ask because I have been spending effort to reduce global usage in a toy
> I'm working on, and did some rather large changes of static structs to
> pointer-to-struct to reduce global from 960 to 336, saving 624 global
> bytes, but the code size increased by 285 just due to accessing via
> pointers in many places.

Probably not a net win.

> I don't yet know if that has impacted performance
> noticeably. I am trying to understand if I should back out these changes
> before doing more work.

I err on the side of simple, and clean it up after the fact when I notice a
problem. Send the simple thing first.

Part of the reason it's good to have context is if diff is using 456 bytes,
anything else trying to get its usage down UNDER that without first addressing
diff isn't really helping.)

Every segment actually gets rounded up to multiples of 4k on all the linux
systems I'm aware of. (In theory there are 1k allocators, in practice most ext2
filesystems can't be mounted on them.) But "this" isn't the ONLY thing in there,
hence scripts/probes/findglobals:

$ scripts/probes/findglobals
D 8	toybox_version
B 80	toys
B 4096	libbuf
B 4096	toybuf
D 7648	toy_list
B 8232	this

Which is a filtered view of:

$ nm --size-sort generated/unstripped/toybox | grep '[0-9A-Fa-f]* [BCDGRS]'
0000000000000004 B __daylight@@GLIBC_2.2.5
0000000000000008 B __environ@@GLIBC_2.2.5
0000000000000008 B stderr@@GLIBC_2.2.5
0000000000000008 B stdin@@GLIBC_2.2.5
0000000000000008 B stdout@@GLIBC_2.2.5
0000000000000008 B __timezone@@GLIBC_2.2.5
0000000000000008 D toybox_version
0000000000000050 B toys
0000000000001000 B libbuf
0000000000001000 B toybuf
0000000000001de0 D toy_list
0000000000002028 B this

(But you'll notice all the symbols that got filtered out by the grep -v GLIBC
are 8k BSS entries. So far, anyway, I dread each upgrade...)

So we have 7648 bytes toy_list (just under 2 pages), plus 8 bytes toy_version
(I'm working to make this a normal string and thus rodata, Elliott has a patch
that moves it to rodata but it's still a separate named symbol and thus still
CLUTTER that I have to EXPLAIN.)

Anyway, 2 pages of data segment per process.

Meanwhile BSS is also a per-process contiguous memory allocation, so adding up:

4096+4096+8232 is just over 4 pages so need 5 contiguous pages (20480 bytes),
and the 80 bytes of "toys" + the glibc debris doesn't change that (goes a
trivial amount into that extra page).

So to launch a new toybox instance on a nommu system, I _think_ I need:

1) 20480 bytes
2) 8192 bytes data
3) 32768 bytes stack.
4) unknown heap.

Each of which is a separate allocation, so the high water mark of contiguous
memory usage is still the stack, and everything else together roughly doubles
the stack. (Modulo what the code then DOES...)

Rob