[Toybox] Impact of global struct size

Tue Jan 2 18:54:41 PST 2024

On Mon, Jan 1, 2024 at 12:39 PM Rob Landley <rob at landley.net> wrote:
>
> On 12/30/23 18:10, Ray Gardner wrote:
> > I am having a bit of trouble understanding the impact of globals.
> >
> > There are the probes GLOBALS and findglobals to see what the space usage
> > is for globals. The output of both show that "this" is up to 8232 bytes,
> > due to the "ip" toy using 8232 of global space.
>
> Which is in pending for a reason.
>
> > The blog entry of 31 August 2023 ends with some discussion of which
> > commands take up the most global space. It says "Everything "tr" and
> > earlier is reasonably sized, and "ip" and "telnet" are in pending."
>
> I sorted them by size and cut and pasted the end of the list, starting with "tr".
>
> Commands in pending haven't been (fully) cleaned up yet, so problems in them
> aren't yet confirmed to be a design problem requiring heavy lifting. Most likely
> something easily fixable I just haven't done the janitorial work for yet.
>
> > I inferred that this means commands in pending are less important here,
> > but they still seem to take up space in "this".
>
> Only when enabled, which they aren't in defconfig. If you don't switch it on in
> config, then it doesn't get build, meaning it doesn't take up any space in the
> resulting toybox binary.
>
> > How important is the space here? "tr" was 520 then, cksum was 1024. How
> > big is too big?
>
> Mostly this is an issue for embedded systems. I doubt android's going to care.

i'm the one who prods you once every year or so suggesting that 4KiB
for toybuf/libbuf is a bit on the small side :-)

> Sorry for the delay replying, I can't figure out how to explain this without a
> GIANT INFODUMP of backstory. The tl;dr is your read-only data is shared between
> all instances of the program, but your _writeable_ data needs a separate copy
> for each instance of the program that's running, and that includes every global
> variable you MIGHT write to. The physical pages can be demand-faulted on systems
> with an MMU (although each fault gets rounded up to page size), but without an
> mmu it's LD_BIND_NOW and then some. See "man 8 ld.so" if you didn't get that
> reference...)
>
> Ok, backstory: since 1996 modern Linux executables are stored in ELF format
> (Executable Linking Format, yes "ELF format" is like "ATM machine").

(no-one made you say "ELF format" ... you did that :-) i tend to say
"ELF file".)

> It's an
> archive format like zip or tar, except what it stores is (among other things) a
> list of "sections" each containing a list of "symbols". Your linker puts this
> together from the .o files produced by the compiler.
>
> Statically linked processes have six main memory mappings, four of which are
> initialized by the kernel's ELF loader (fs/binfmt_elf.c) from sections in the
> ELF file, and the other two are generated at runtime. All six of these are
> created by the kernel during the execve(2) system call, mostly l wanders through
> fs/binfmt_elf.c (or fs/binfmt_fdpic.c which is kind of an ext2/ext3 thing that's
> MOSTLY the same with minor variants and REALLY SHOULD be the same file but isn't
> because kernel development became proctologically recursive some years ago).
>
> You can look at /proc/self/maps (and /proc/self/smaps, and
> /proc/self/smaps_rollup) to see them for a running process (replace "self" with
> any running PID, self is a symlink to your current PID). The six sections are:
>
>   text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC)
>   rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, PROT_READ)
>   data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, PROT_WRITE)
>   bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE)
>   stack - function call stack, also contains environment data
>   heap - backing store for malloc() and free()

(Android and modern linux distros require the relro section too.
interestingly, there _is_ an elf program header for the stack, to
signal that you don't want an executable stack. iirc Android and [very
very recently] modern linux distros won't let you start a process with
an executable main stack, but afaik the code for the option no-one has
wanted/needed for a very long time is still in the kernel.)

> The first three of those literally exist in the ELF file, as in it mmap()s a
> block of data out of the file at a starting offset, and the memory is thus
> automatically populated with data from the file. The text and rodata ones don't
> really care if it's MAP_PRIVATE or MAP_SHARED because they can never write
> anything back to the file, but the data one cares that it's MAP_PRIVATE: any
> changes stay local and do NOT get written back to the file. And the bss is an
> anonymous mapping so starts zeroed, the file doesn't bother wasting space on a
> run of zeroes when the OS can just provide that on request. (It stands for Block
> Starting Symbol which I assume meant something useful 40 years ago on DEC hardware.)

(close, but it was IBM and the name was slightly different:
https://en.wikipedia.org/wiki/.bss#Origin)

> All four of those ELF sections (text, rodata, data, bss) are each treated as a
> giant struct under the covers, because that's how C thinks. Every time you
> reference a variable the C code goes "I have a pointer to the start of this, and
> I have an offset into it where this particular symbol lives within that segment,
> and I know the type and thus size of the variable living at that offset" every
> time you reference a symbol that lives there.
>
> The remaining two memory blocks aren't part of ELF, but they're needed at runtime.
>
> The stack is also set up by the kernel, and is funny in three ways:
>
> 1) it has environment data at the end (so all your inherited environment
> variables, and your argv[] arguments, plus an array of pointers to the start of
> each string which is what char *argv[] and char *environ[] actually point to.
> The kernel's task struct also used to live there, but these days there's a
> separate "kernel stack" and I'd have to look up where things physically are now
> and what's user visible.

(plus the confusingly named "ELF aux values", which come from the
kernel, and aren't really anything to do with ELF --- almost by
definition, since they're things that the binary _can't_ know like
"what's the actual page size of the system i'm _running_ on?" or
"what's the l1d cache size of the system i'm _running_ on?".)

> 2) It grows down, not up. (Except on pa-risc, but that's essentially extinct.)
> The pointer starts at the end of the stack space (well, right before the
> environment data), and each function call SUBTRACTS from it, and each function
> call adds back to it.
>
> Your local variables are basically ANOTHER struct where "I have a register
> pointing to a location, and each name is an offset+type from that register'
> (it's how C does everything). When you make a function call, the pointer moves
> to fresh empty stack space so the local variables from last time aren't modified
> by the current function, and then when you return it moves it back.
>
> By moving down, a function can grab however much stack space it needs in one
> place at the start of the function (I need X bytes, so sp -= X), and then that
> "pointer+offset" logic is going into the protected space it's reserved for
> itself. If the stack grew up, it would either have to SUBTRACT the offset to use
> space it grabbed at the start, or else hold off to protect _this_ function's
> space right before each function call (duplicated code, and also using
> unprotected space so outside observers like debuggers would never _really_ know
> how much of the stack was currently in use). The function can implement "return"
> as shared code where it jumps to a single "add this many bytes back to the stack
> pointer" instruction on the way out. Add each function call pushing the
> instruction pointer to the stack before jumping to the new function, and then
> you just "pop ip" after fixing the stack pointer and you've returned from your
> function.
>
> 3) The stack generally has _two_ pointers, a "stack pointer" and a "base
> pointer" which I always get confused. One of them points to the start of the
> mapping (kinda important to keep track of where your mappings are), and the
> other one moves (gets subtracted from and added to and offset to access local
> variables).

(s/base pointer/frame pointer/ for everything except x86. and actually
_both_ change. it's the "base" of the current stack _frame_, not the
whole stack. for a concrete example: alloca() changes the stack
pointer, but not the frame pointer. so local variables offsets
relative to fp will be constant throughout the function, whereas
offsets relative to sp can change. [stacked values of] fp is also what
you're using when you're unwinding.)

> All this is ignoring dynamic linking, in which case EACH library has those first
> four sections (plus a PLT and GOT which have to nest since the shared libraries
> are THEMSELVES dynamically linked, which is why you need to run ldd recursively
> when harvesting binaries, although what it does to them at runtime I try not to
> examine too closely after eating). There should still only be one stack and heap
> shared by each process though.

(one stack _per thread_ in the process. and the main thread stack is
very different from thread stacks.)

> If you launch dozens of instances of the same program, the read only sections
> (text and rodata) are shared between all the instances. (This is why nommu
> systems needed to invent fdpic: in conventional ELF everything uses absolute
> addresses, which is find when you've got an MMU because each process has its own
> virtual address range starting at zero. (Generally libc or something will mmap()
> about 64k of "cannot read, cannot write, cannot execute" memory there so any
> attempt to dereference a NULL pointer segfaults, but other than that...)
>
> But shared libraries need to move so they can fit around stuff. Back in the
> a.out days each shared library was also linked at an absolute address (just one
> well above zero, out of the way of most programs), meaning when putting together
> a system you needed a registry of what addresses were used by each library, and
> you'd have to supply an address range to each library you were building as part
> of the compiler options (or linker script or however that build did it). This
> sucked tremendously.

(funnily enough, this gets reinvented as an optimization every couple
of decades. iirc macOS has "prelinking" again, but Android is
currently in the no-prelinking phase of the cycle.)

> So ELF PIC (Position Independent Code) was invented,

(PIC long predates shared libraries:
https://en.wikipedia.org/wiki/Position-independent_code#History )

> where everything is offset
> from a single pointer (all the segments are glued together one after the other,
> so each segment has a "starting offset". Register + segment offset + variable
> offset = location of variable). PIC was invented for shared libraries, but the
> security nuts immediately decided they wanted executables to work like that too
> (so exploit code that jumped to absolute addresses where it "knew" a function
> lived would stop working. Moving shared libraries around helped, but moving the
> executable's own code around nerfed more attack surface). So they invented PIE
> (Position Independent Executables) which means your executable is compiled with
> -fpic and then your startup code (crt0.o and friends) does some shared library
> setup. (This may also have involved kernel changes, I knew back around 2007 but
> between me forgetting and the plumbing being a moving target...)
>
> (Note: legacy 32 bit x86 does not have enough registers, so anything that uses
> one more register leaves less for the rest of the code resulting in extra stack
> spills and such, although they were already shoving %fs and %gs or some such in
> there decades ago and x86-64 added 8 more and x32 is a hybrid that can still use
> those 8 extra in 32 bit mode... Anyway, there used to be strong reasons to skimp
> on register usage, and these days not so much. Outside of legacy x86,
> architectures with FEW registers have 16, and the rest have 32. So eating a
> half-dozen for "API" is acceptable.)

(32-bit x86 PIC is almost maximally pessimal: https://ewontfix.com/18/
funnily enough, Android doesn't support lazy binding, so there was no
reason for Android's 32-bit x86 ABI to be this awful, but i don't
think anyone was paying attention at the time. even intel was already
only really concerned with x86-64 where it's only somewhat ugly.)

> So NOMMU systems: processors that do not have a memory management unit are
> generally called "microcontrollers" instead of "microprocessors", and they use
> physical addresses for everything. This has some advantages: not only is it less
> circuitry (less power consumption), but it's more memory efficient because there
> are no page tables (Vitaly Wool gave a talk at ELC in 2015 where he had Linux
> running in 256k of SRAM, and everything else running out of ROM (well, nor
> flash), which was only possible because he didn't spend any memory on page
> tables) and easier to get hard realtime behavior because there are no soft faults...
>
> NOMMU systems have been analogized to bacteria, fungi, and insects: there are
> more bacterial cells on the planet than mammal cells by a VERY LARGER MARGIN.
> But they're mostly invisible to the elephants and dinosaurs of the world. NOMMU
> systems are EVERYWHERE. Mostly Linux is too big to play in that space, but even
> a small slice of it is more deployments than everything else Linux does combined
> (yes including phones).
>
> A MMU has an extra set of registers called a TLB (Translation Lookaside Buffer)
> which translates each memory access (this memory address you're using is
> actually THIS physical memory address). Each TLB entry is a starting address, a
> range, and some permission flags (this entry is good for read, write, or execute
> attempts). The TLB entries are all checked in parallel in a single clock cycle,
> so there are never many of them (the wiring increases exponentially the more you
> have). When a memory access the code tries to do ISN'T in the TLB, the processor
> generates an interrupt ("fault" to hardware people) and execution jumps to the
> interrupt handler, which traverses a data structure called a "page table" to
> figure out what that memory access should do. It can swap out one of the TLB
> entries to stick in the appropriate virtual/physical translation and return from
> the interrupt (this is called a "soft fault"). Or it can edit the page table to
> allocate more physical memory (if this adds a zeroed page it's deferred
> allocation, if it makes a copy of previously shared memory you're now writing to
> it's copy-on-write). If it hasn't got enough memory to satisfy the attempted
> access it can evict some other physical page (writing its contents to swap),
> generally suspending the process and letting the scheduler pick something else
> to run until the write's finished and it can reuse that page. Similarly, if that
> page isn't available because it was swapped out (or is mmaped from a file but
> not present in memory) the fault handler can schedule a load from disk (or
> network filesystem!) and again suspend the process until that completes and the
> data's ready to edit the page table, fix up the TLB, and resume from the fault.
>
> Some old processors had page fault handling in hardware. This sucked rocks, and
> Linus yelled at the PowerPC guys somewhat extensively about this 20 years ago.
> Page fault handling in software is sane, page fault handling in hardware isn't.
> This does mean you need to tie up a TLB entry pointing at your page table, and
> another pointing at your memory fault handler code, so the software can run it.
> (Or you stick the processor into "physical addressing" mode to bypass the TLB
> during the interrupt so you're running without the MMU and all memory addresses
> are physical addresses. Different processors do it different ways.)
>
> A NOMMU system hasn't got a TLB. Sometimes it will have "range registers", which
> are similar in that they provide windows into physical memory where access is
> allowed. The same "start, length" logic for address matching, but all it does is
> allow or disallow access. (That way your userspace processes can't access each
> other's memory, and cant' read or write kernel memory either.) But this isn't
> very common because the MTRR plumbing is 90% of the way to a software TLB: add a
> third value (an offset added to the physical address when this range matches;
> remember unsigned values can wrap cleanly) and the RWX access type flags (so you
> can match TYPE of access allowing read-only or non-executable memory), give the
> fault handler the ability to resume the interrupted instruction after doing
> fixups... and you have an MMU.
>
> Things a NOMMU system _can't_ do:
>
> 1) Remap addresses.
>
> On a system with an mmu, every process can have a different mapping at address
> 0x1000 and each sees its own unique contents there. Without an mmu, what's in
> the memory is the same for everybody.
>
> 2) Collate discontiguous memory.
>
> With an mmu, if I mmap(MAP_ANON) a megabyte long, the underlying physical pages
> the page table slots in may be scattered all over the place, and it's not my
> problem.
>
> Without an MMU, if I want a 256k mapping I need to find 256k of contiguous
> physical memory (all together in one linear chunk). And if the system's been
> running for a while, I may have way more than 256k free but it's in scattered
> pieces none of which is big enough to satisfy my allocation request.
>
> Fragmentation is a much bigger problem on NOMMU systems.
>
> 3) use swap space
>
> It's built on top of page tables, and we haven't got those. So I can't use
> storage to increase my available memory: the physical memory I've got is it.
>
> 4) Demand fault
>
> If I allocate memory, I have to have that memory NOW. I can't allocate page
> table space and leave the physical memory unpopulated, and then attach pages to
> it from the fault handler as the processor actually tries to read or write to
> the memory.
>
> If I want to mmap() something backed from a file, I have to load it all in
> during the mmap() call. I can't detect access and load it in just the part that
> was accessed from the fault handler.
>
> I can't copy-on-write either. With an MMU, I have a writeable mapping that
> starts identical to another process's memory I can point the new mapping at the
> old one's physical pages and mark it read only, and then when they try to write
> have the fault handler allocate a new physical page, copy the old data, and
> redirect the write to the new page. But without an MMU, a separate writeable
> mapping is separate physical pages right from the start, it just initializes it
> with a copy of the data.
>
> 5) fork() processes.
>
> An identical copy of this process would want to use the same addresses. If you
> make copies of this process's mappings, they will live at different address
> ranges. All the pointers in the new copy point into to the OLD copy's memory.
>
> Instead nommu systems have to use vfork(), which suspends the parent until the
> child either calls exec() or _exit(), because both discard all the old mappings
> (and thus the parent can get them back). Note that changes to global variables
> in the child before calling exec() will affect the parent's data!
>
> Often vfork() won't even give the child a new stack, which means values written
> to _local_ variables are also visible in the parent when that resumes, AND it
> means the child CANNOT RETURN FROM THE FUNCTION THAT CALLED VFORK (because that
> will discard the current stack frame, and then the next function call will
> overwrite it with nonsense, so when the parent resumes Bad Things Happen).

(that's a good trivia question that most libcs got wrong until fairly
recently: "what three libc functions need to be annotated as
returns_twice?" :-) )

> 6) Most mmap() flags don't work.
>
> All mmap() really does on nommu is allocate phyiscal memory ranges so other
> mappings won't use it. There's no protection or reorganization.
>
> You can mmap(MAP_ANONYMOUS). You can only MAP_FIXED if nobody else is using that
> address yet. Everything is always MAP_LOCKED and MAP_POPULATE.
>
> You can mmap(MAP_SHARED) but _not_ with PROT_WRITE (because it has no way to
> detect changed data needs to be written back: the MMU trick is to remove the
> write bit from "clean" pages, take a fault the first time you try to write to
> them, and let the write go through but schedule the page to be written out to
> disk; there's optimizations with a second "dirty" bit to reduce the number of
> faults taken without missing updates as the page is written to disk). And
> remember, read-only mappings are fully populated up front, which is expensive
> (and a latency spike because mmap() won't return until it's finished reading all
> the data into memory).
>
> You can't MAP_PRIVATE because that does copy-on-write when you change the
> contents. You can mmap(MAP_SHARED) because "everybody sees the changes" is the
> default, but only read-only mappings work.
>
> 7) Dynamically grow the stack.
>
> Again, no faults for the interrupt handler to fix up via deferred physical page
> allocation, so the stack size you request is the stack size the kernel allocates
> for you in exec. And it's gotta be contiguous or the process can't START, so
> ideally nommu systems use as small a stack size as possible. (This is another
> extra field fdpic and binflt added: binflt as the old nommu variant of a.out,
> it's obsolete, don't go there. You can set it via a linker option, or it's got a
> default you can specify in the ./configure when you build your toolchain.)
>
> The default stack size on kernels with mmu is generally 8 megabytes. Common
> stack sizes on nommu range from 4k to 128k. Once upon a time toybox could MOSTLY
> be made to work with 16k, although it's been a while since I've regression
> tested it and the shell ain't gonna be happy. I'd do more like 64k.
>
> NOMMU also uses its own executable formats, so you have to compile and link your
> binaries differently.
>
> You can't run most normal ELF binaries on NOMMU, because those are linked to use
> absolute addresses, and if those specific memory addresses aren't available
> (because the kernel's there, for example) it won't load and can't run. You MIGHT
> be able to run ONE ELF binary if you prepared it specially, but can't run even a
> second copy of the same one (because it would want to re-use the same addresses).
>
> You can run PIE binairies on NOMMU, but it's not ideal. Those glue all the ELF
> segments together into a contiguous chunk of memory, but have a register
> pointing to the starting address. So you can run the binary, assuming you can
> find a big enough contiguous chunk of memory to stick it in. (The heap and stack
> can fit anywhere, those always had a register pointing to the start of each,
> even back on x86.) But you need a BIG contiguous chunk, which is hard to find
> once memory gets fragmented. And it's wasteful because the read-only sections
> (text and rodata) can't be shared, since they're contiguous with the writeable
> sections. (There's only one "start of PIE" pointer, and all 4 ELF segments are
> glued together after it because the offsets are calculated the way they are in
> conventional ELF, just with the addition of a base pointer instead of always
> starting at 0.)
>
> FDPIC is an ELF variant that separates the segments, which means it uses 4
> registers, one for each segment. (In theory you could use one register pointing
> to an array of 4 pointers and have an extra dereference on every memory access,
> but nobody wants to take the speed hit. That's why "fdpic for 32 bit x86" didn't
> happen.) This means your read only sections CAN be shared, and that the other
> writeable segments are independently moveable and can fit into smaller scraps of
> fragmented memory.
>
> SO: cycling back to the original question:
>
> The GLOBALS() block is the majority of the data segment. That and the stack are
> the two big contiguous allocations for each new process. (The heap is also there
> but it's happens later and there's stuff libc can do to make it suck less. In
> theory EACH malloc() could be a separate mmap() which avoids fragmentation and
> would also round up each allocation to 4k so is actually a huge net loss, but
> the point is a nommu-aware allocator can break the heap up into multiple mmap()s
> to mitigate fragmentation, and that can happen transparently within libc and not
> be the userspace programmer's immediate problem.)

(regular allocators do this already. especially for "large"
allocations [with the main distinction being what any given allocator
considers "large"]. afaik, only the really ancient brk()-style
allocators didn't do this. "i haven't worked on an allocator that
didn't". strictly, dlmalloc _can_ work in just the old brk() style,
but even in the 1990s they recommended against that.)

> (Yes I said the kernel gives you a starting heap when I listed the 6 segments
> above. It's a legacy thing from the PDP-11 days in the 1970s. "man 2 brk" is
> related. I can explain, but... ask Rich Felker. I he has a rant. I believe this
> is another thing the libc author can shoot in the face and make it stop; for our
> purposes you can probably ignore it. Modern allocation is more or less built on
> mmap(), I think.)

(yes, to the extent that we _talked_ about disallowing the brk()
syscall via seccomp. looks like we never did though. i should file a
bug about that...)

> > As long as "this" is as big as the largest GLOBAL struct, then what is the
> > point of working to reduce the global space of any command, when the space
> > for "ip" is in there whether "ip" is compiled into toybox or not?
>
> The space for "ip" shouldn't be there when ip isn't compiled in. Looks like I
> should add USE() macros around each struct in union global_data in
> generated/globals.h.
>
> > What am
> > I missing? Why are global structs included in globals.h for commands not
> > included in a build? Or are they somehow suppressed in the build?
>
> On a system with mmu the pages are demand faulted so the main issue there is
> memory usage being rounded up to 4k.
>
> > The globals do not seem to affect the size of the executable file, at
> > least using the default build. Is the issue with "this" taking up space at
> > runtime? Many commands must surely allocate much more space dynamically
> > anyway.
> >
> > I ask because I have been spending effort to reduce global usage in a toy
> > I'm working on, and did some rather large changes of static structs to
> > pointer-to-struct to reduce global from 960 to 336, saving 624 global
> > bytes, but the code size increased by 285 just due to accessing via
> > pointers in many places.
>
> Probably not a net win.
>
> > I don't yet know if that has impacted performance
> > noticeably. I am trying to understand if I should back out these changes
> > before doing more work.
>
> I err on the side of simple, and clean it up after the fact when I notice a
> problem. Send the simple thing first.
>
> Part of the reason it's good to have context is if diff is using 456 bytes,
> anything else trying to get its usage down UNDER that without first addressing
> diff isn't really helping.)
>
> Every segment actually gets rounded up to multiples of 4k on all the linux
> systems I'm aware of. (In theory there are 1k allocators, in practice most ext2
> filesystems can't be mounted on them.) But "this" isn't the ONLY thing in there,
> hence scripts/probes/findglobals:
>
> $ scripts/probes/findglobals
> D 8     toybox_version
> B 80    toys
> B 4096  libbuf
> B 4096  toybuf
> D 7648  toy_list
> B 8232  this
>
>
> Which is a filtered view of:
>
> $ nm --size-sort generated/unstripped/toybox | grep '[0-9A-Fa-f]* [BCDGRS]'
> 0000000000000004 B __daylight@@GLIBC_2.2.5
> 0000000000000008 B __environ@@GLIBC_2.2.5
> 0000000000000008 B stderr@@GLIBC_2.2.5
> 0000000000000008 B stdin@@GLIBC_2.2.5
> 0000000000000008 B stdout@@GLIBC_2.2.5
> 0000000000000008 B __timezone@@GLIBC_2.2.5
> 0000000000000008 D toybox_version
> 0000000000000050 B toys
> 0000000000001000 B libbuf
> 0000000000001000 B toybuf
> 0000000000001de0 D toy_list
> 0000000000002028 B this
>
> (But you'll notice all the symbols that got filtered out by the grep -v GLIBC
> are 8k BSS entries. So far, anyway, I dread each upgrade...)

(isn't that _bytes_? 0x1000 == 4096 makes toybuf right. daylight
should be an int, and the others should just be pointers, so one 4 and
a bunch of 8 [byte]s sounds right.)

> So we have 7648 bytes toy_list (just under 2 pages), plus 8 bytes toy_version
> (I'm working to make this a normal string and thus rodata, Elliott has a patch
> that moves it to rodata but it's still a separate named symbol and thus still
> CLUTTER that I have to EXPLAIN.)

yeah, i hadn't understood from your blog rant that it was a _tidiness_
thing ... with my "i'm not actually a security person, but i do
sometimes play one on the internet" hat on, i'm usually worrying about
things that don't need to be writeable and/or executable and removing
that protection bit. if i cared about every last symbol, the size of
Android's system image would have sent me over the edge years ago :-)

> Anyway, 2 pages of data segment per process.
>
> Meanwhile BSS is also a per-process contiguous memory allocation, so adding up:
>
> 4096+4096+8232 is just over 4 pages so need 5 contiguous pages (20480 bytes),
> and the 80 bytes of "toys" + the glibc debris doesn't change that (goes a
> trivial amount into that extra page).
>
> So to launch a new toybox instance on a nommu system, I _think_ I need:
>
> 1) 20480 bytes
> 2) 8192 bytes data
> 3) 32768 bytes stack.
> 4) unknown heap.
>
> Each of which is a separate allocation, so the high water mark of contiguous
> memory usage is still the stack, and everything else together roughly doubles
> the stack. (Modulo what the code then DOES...)
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net