[Toybox] Impact of global struct size

Wed Jan 3 13:52:59 PST 2024

On Thu, Jan 4, 2024 at 4:30 AM Rob Landley <rob at landley.net> wrote:
>
> I note that I've written over a hundred lines of rant in response to his
> previous email already. I should dig back through this and turn it into proper
> documentation at some point. (Especially since Elliott knows more of this stuff
> than I do so I'm likely to get corrected a lot here...)
>
> On 1/2/24 20:54, enh wrote:
> >> You can look at /proc/self/maps (and /proc/self/smaps, and
> >> /proc/self/smaps_rollup) to see them for a running process (replace "self" with
> >> any running PID, self is a symlink to your current PID). The six sections are:
> >>
> >>   text - the executable functions: mmap(MAP_PRIVATE, PROT_READ|PROT_EXEC)
> >>   rodata - const globals, string constants, etc: mmap(MAP_PRIVATE, PROT_READ)
> >>   data - writeable data initialized to nonzero: mmap(MAP_PRIVATE, PROT_WRITE)
> >>   bss - writeable data initialized to zero: mmap(MAP_ANON, PROT_WRITE)
> >>   stack - function call stack, also contains environment data
> >>   heap - backing store for malloc() and free()
> >
> > (Android and modern linux distros require the relro section too.
>
> I thought that was only needed for dynamic linking? Then again you don't allow a
> lot of static stuff to run on the final system anyway...
>
> (The line between PIE and dynamic linking confuses even me. How does static PIE
> relocate itself? I _think_ I looked it up once, but "it's statically links in a
> dynamic linker in the pile of crt1.o and begin.o files" _can't_ be right...)
>
> > interestingly, there _is_ an elf program header for the stack, to
> > signal that you don't want an executable stack. iirc Android and [very
> > very recently] modern linux distros won't let you start a process with
> > an executable main stack, but afaik the code for the option no-one has
> > wanted/needed for a very long time is still in the kernel.)
>
> Cool.
>
> These days there's also vdso and vvar, which are provided by the kernel at
> runtime. The first is a .text section with magic functions you can call as an
> alternative to syscalls, and the second is a magic .rodata section that provides
> volatile variables the OS updates which you can just reach out and look at.
>
> Between the two of them you can do things like check the current timestamp
> without a system call. What they actually provide varies by OS (and then your
> libc has to be taught to use each new capability out of there instead of making
> the syscalls).
>
> "cat /proc/self/maps" and they're the last two entries if present.
>
> There is a "man 7 vdso" but I dunno how up to date it is. (Which gets us back to
> Michael Kerrisk's retirement and the new guy NOT MAINTAINING A WEB COPY. Grrr.)
>
> Maintaining backwards compatibility means keeping a lot of old stuff. I had a
> talk with Rich Felker last night on IRC about what musl-libc's syscall
> requirements actually _are_, and what it would take to repot it on top of a
> posix-ish RTOS du jour. (Makes the trusting trust cleansing cycle smaller if you
> can cross compile Linux from an RTOS...)

I did the "run linux-musl binaries on an RTOS" part a few years ago
and ended up with this list:

https://github.com/apexrtos/apex/blob/master/sys/kern/syscall_table.c

It's by no means exhaustive, but it was enough to run a useful set of
toybox toys, busybox's ash and enough other stuff to build a
commercial product running on an armv7-m (nommu) chip on top of it. I
had a risc-v port working and was in the middle of getting powerpc
(mmu) stuff running when circumstances changed and I had to move on.

I'm not sure how many more syscalls would be required to be able to
compile Linux, but probably not a whole lot.

Patrick

> We didn't come to a conclusion, but I _did_ get permission from skarnet to use
> his git://git.skarnet.org/mdevd under 0BSD. (POrting that to toybox seems easier
> than bringing my old mdev code up to speed for all the
> https://github.com/slashbeast/mdev-like-a-boss stuff it's grown since I handed
> it off.
>
> >> The first three of those literally exist in the ELF file, as in it mmap()s a
> >> block of data out of the file at a starting offset, and the memory is thus
> >> automatically populated with data from the file. The text and rodata ones don't
> >> really care if it's MAP_PRIVATE or MAP_SHARED because they can never write
> >> anything back to the file, but the data one cares that it's MAP_PRIVATE: any
> >> changes stay local and do NOT get written back to the file. And the bss is an
> >> anonymous mapping so starts zeroed, the file doesn't bother wasting space on a
> >> run of zeroes when the OS can just provide that on request. (It stands for Block
> >> Starting Symbol which I assume meant something useful 40 years ago on DEC hardware.)
> >
> > (close, but it was IBM and the name was slightly different:
> > https://en.wikipedia.org/wiki/.bss#Origin)
>
> That says United Aircraft Corporation named it using IBM 704 hardware in an
> assembler and then in fortran. (I only give wikipedia[citation needed] about an
> 80% chance to be accurate about any given fact, but am not root causing it right
> now. :)
>
> I like to track down magic acronyms, ala grep meaning "get regular expression".
> I once emailed Dennis Ritchie to ask what "inode" meant:
>
> https://lkml.iu.edu/hypermail/linux/kernel/0207.2/1182.html
>
> But in this case I stopped paying attention once I confirmed it doesn't mean
> anything of modern relevance.
>
> The interesting part (to me) is that the name predates unix by almost 20 years
> (mainframe legacy predating even the PDP-1), and predating ELF by 40 years. (The
> first OS with ELF binaries was Solaris 2.0 released in 1992. Linux switched over
> 3-4 years later.)
>
> If it wasn't a legacy acronym from shortly after world war II, it would probably
> be called something like the "zero section" and we wouldn't have to memorize
> what it means. :)
>
> >> The stack is also set up by the kernel, and is funny in three ways:
> >>
> >> 1) it has environment data at the end (so all your inherited environment
> >> variables, and your argv[] arguments, plus an array of pointers to the start of
> >> each string which is what char *argv[] and char *environ[] actually point to.
> >> The kernel's task struct also used to live there, but these days there's a
> >> separate "kernel stack" and I'd have to look up where things physically are now
> >> and what's user visible.
> >
> > (plus the confusingly named "ELF aux values", which come from the
> > kernel, and aren't really anything to do with ELF --- almost by
> > definition, since they're things that the binary _can't_ know like
> > "what's the actual page size of the system i'm _running_ on?" or
> > "what's the l1d cache size of the system i'm _running_ on?".)
>
> Are they in the stack? I know the pointer is passed to _start() (often not in a
> proper argument, in a REGISTER), but hadn't tracked down where it actually
> lived. Stack makes sense...
>
> Sadly, I have had to care about the auxiliary vector on far too many occasions:
>
> man 3 getauxval
>
> >> 3) The stack generally has _two_ pointers, a "stack pointer" and a "base
> >> pointer" which I always get confused. One of them points to the start of the
> >> mapping (kinda important to keep track of where your mappings are), and the
> >> other one moves (gets subtracted from and added to and offset to access local
> >> variables).
> >
> > (s/base pointer/frame pointer/ for everything except x86. and actually
> > _both_ change. it's the "base" of the current stack _frame_, not the
> > whole stack. for a concrete example: alloca() changes the stack
> > pointer, but not the frame pointer. so local variables offsets
> > relative to fp will be constant throughout the function, whereas
> > offsets relative to sp can change. [stacked values of] fp is also what
> > you're using when you're unwinding.)
>
> I only implemented alloca() for my tinycc fork on 32-bit x86, and that was back
> in 2008.
>
> I'm hoping to sit on tonight's https://meet.jit.si/golug at 6pm about creating a
> compiler with a recursive descent parser, and someday hope to read
> https://norasandler.com/2017/11/29/Write-a-Compiler.html and the corresponding
> https://nostarch.com/writing-c-compiler and https://github.com/nlsandler/nqcc
> but right now restarting my https://landley.net/code/qcc is not even on the back
> burner...
>
> >> All this is ignoring dynamic linking, in which case EACH library has those first
> >> four sections (plus a PLT and GOT which have to nest since the shared libraries
> >> are THEMSELVES dynamically linked, which is why you need to run ldd recursively
> >> when harvesting binaries, although what it does to them at runtime I try not to
> >> examine too closely after eating). There should still only be one stack and heap
> >> shared by each process though.
> >
> > (one stack _per thread_ in the process. and the main thread stack is
> > very different from thread stacks.)
>
> A thread is a process with brain damage inherited from solaris' limitations, but
> you're right. I just mentally gloss over threads as "process with training
> wheels and 5x the debugging effort".
>
> Even before the ~7 year period where I thought java was a good idea, I had to
> use threading VERY EXTENSIVELY on OS/2. The "workplace shell" desktop was a
> single process with many, many threads, so any desktop programming there meant
> creating a shared library the workplace shell process would dlopen() and launch
> threads for. I got very, very good at debugging thread issues, once upon a time.
> (And I've debugged a lot of OTHER people's threading issues as a consultant. The
> oil exploration company that bought three different programs and mushed them
> together into a single highly threaded process that leaked like a sieve and
> segfaulted randomly. The 2018 project that replaced WinCE with Linux when
> microsoft end-of-lifed wince, resulting in an 80 thread application process,
> half of which were C# code running in mono and the other half were linux native
> code sharing the same address space, and the PROBLEM was on the ~200 mhz
> deployment hardware they had a warehouse full of and wanted to keep selling,
> fork() caused a 75 millisecond latency spike in EVERY OTHER THREAD because the
> kernel took one look at that mess and locked the whole vma until fork() had
> finished copying everything, which meant a thread spawning a child process
> caused the token-ring-like bus to timeout and drop connection. Which meant I got
> to do a real world use of vfork() on a system with an MMU, because that only
> suspends the PARENT thread, not all the other threads in the process, and
> vfork()/exec() isn't much that much harder to program around than fork()/exec().)
>
> My modern reaction to dealing with threads is...
>
> https://www.youtube.com/watch?v=hlVwbpm4eHI
>
> They're SOMETIMES the right tool for the job? Occasionally? Maybe?
>
> >> If you launch dozens of instances of the same program, the read only sections
> >> (text and rodata) are shared between all the instances. (This is why nommu
> >> systems needed to invent fdpic: in conventional ELF everything uses absolute
> >> addresses, which is find when you've got an MMU because each process has its own
> >> virtual address range starting at zero. (Generally libc or something will mmap()
> >> about 64k of "cannot read, cannot write, cannot execute" memory there so any
> >> attempt to dereference a NULL pointer segfaults, but other than that...)
> >>
> >> But shared libraries need to move so they can fit around stuff. Back in the
> >> a.out days each shared library was also linked at an absolute address (just one
> >> well above zero, out of the way of most programs), meaning when putting together
> >> a system you needed a registry of what addresses were used by each library, and
> >> you'd have to supply an address range to each library you were building as part
> >> of the compiler options (or linker script or however that build did it). This
> >> sucked tremendously.
> >
> > (funnily enough, this gets reinvented as an optimization every couple
> > of decades. iirc macOS has "prelinking" again, but Android is
> > currently in the no-prelinking phase of the cycle.)
>
> The old line about how there are two hard problems in computer science: naming
> things, cache invalidation, and fencepost errors. This falls under 'cache
> invalidation", which more generically is "object lifetime rules".
>
> The really FUN one is the horrible trick people did on various embedded systems
> for fast boot, or on OpenVZ as part of the live migration, where they'd
> basically core dump a process, load it into a debugger, and resume. Thus
> skipping all the setup! (Assuming NOTHING HAS CHANGED in the context the resumed
> process expects around it. Luckily X11 has "detach and restart" plumbing that
> lets it reopen a process's network pipe without killing the window or the
> process, because network connections hanging and needing retry isn't a new thing.)
>
> Sigh, I did a whole rant about what would be involved in kernel upgrades without
> reboots way back in 2002:
>
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html
> https://lkml.iu.edu/hypermail/linux/kernel/0206.2/1244.html
>
> And I was just going "this is _hard_" but people tracked me down from that and
> had me help IMPLEMENT some of that stuff over the years. The hard part was that
> processes act in GROUPS: parent/child relationships and pipelines and so on, and
> the kernel had no way to group processes. Enter "container" support, and me
> helping the parallels/OpenVZ guys explain _why_ the kernel could benefit from
> it. (The number of times I've been hired as a programmer and wound up spending
> most of my energy as a combination tech writer and marketer...)
>
> Sigh, I gotta go get on an airplane now, so stopping here for the moment...
>
> Rob
> _______________________________________________
> Toybox mailing list
> Toybox at lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net