[Toybox] [musl] Not sure how to debug this one.

Sat Feb 17 09:02:06 PST 2024

On Fri, Feb 16, 2024 at 07:48:27PM -0600, Rob Landley wrote:
> While grinding away at release prep, I hit a WEIRD one. The qemu-system-sh4
> target got broken by commit 3e0e8c687eee (PID 1 exits trying to run the init
> script), which is the commit that changed the stdout buffering type.
> 
> It's not the kernel, if I use the last release kernel with the new root
> filesystem I see the problem, and newly built kernel from today's git with last
> release's initramfs.cpio.gz boots to a shell prompt.
> 
> The actual _problem_ is that sigsetjmp() is faulting (in sh.c function
> run_command()), for NO OBVIOUS REASON. Calling memset() to zero the struct
> before the sigsetjmp() works fine, but the sigsetjmp() call (built against
> musl-libc) never returns.
> 
> Not siglongjmp, _sigsetjmp_. Which means it's failing somewhere in:
> 
> https://git.musl-libc.org/cgit/musl/tree/src/signal/sh/sigsetjmp.s
> 
> And I dunno how to stick a printf into superh assembly code.

Rather than "stick a printf in there", can you identify (with gdb or
strace or qemu user execution tracing) exactly which instruction it's
crashing at, and the register values at the point of crash?

Provided it was called with a valid pointer to the sigjmp_buf, there
should be no way the initial call to sigsetjmp can segfault. The only
memory accesses it makes are to that object. It does make a call to
setjmp, which in theory could clobber the call-saved r8 containing
sigjmp_buf address, but setjmp does not do that. It's possible that,
on second return, this has been clobbered; even a single-byte buffer
overflow into the sigjmp_buf would do that, and sh may be unique in
having the relevant register at the beginning of the buffer, which
could explain it happening only on sh. But that would affect second
return not the first.

> The sigjmp_buf lives on the stack, but I confirmed it's 8 byte aligned, and not
> even straddling a page boundary. I can access variables I stick before and after
> it, so it can't be some kind of "fault due to guard page" weirdness? (I suppose
> the optimizer may be invalidating that test, I could try adding "volatile"...)
> 
> While debugging I made the problem GO AWAY more than once by sticking printfs()
> and similar into the code, but that's not FIXING it. Adding another sigjmp_buf
> declaration and call to sigsetjmp() right at the start of the function works
> fine (although the other one in the place it's in now still fails). I confirmed

This all suggests that there's a buffer overflow and shuffling things
around on the stack is preventing it. Have you tried running (even on
unaffected archs) under valgrind to look for such errors?

Rich