[Toybox] [PATCH] Factor out dhcpd's and syslogd's main loops into a library function

Sun Dec 1 11:05:54 PST 2013

Taking the remaining holiday time to close old reply windows and such.  
I've ahd this one open a while... :)

On 09/06/2013 05:35:27 PM, Felix Janda wrote:
> Rob Landley wrote:
> > On 09/01/2013 08:54:27 AM, Felix Janda wrote:
> > > Both syslogd and dhcpd wait for sockets, signals or a timeout.
> > > Currently,
> > > they use select() with its timeout to listen to the sockets and a
> > > self-pipe for the signals.
> > >
> > > The new library function instead uses pselect() without a timeout  
> and
> > > alarm(). With pselect() signals are only unblocked during this  
> system
> > > call; so the xwrappers don't need to be aware about interrupted
> > > syscalls.
> >
> > Are we normally blocking a lot of signals in other commands? Using  
> the
> > 6 argument version instead of the 5 argument version addresses an  
> issue
> > for which existing command?
> 
> If a signal occurs during the block it will still be delivered in the  
> next
> pselect loop. pselect implements exactly the signal unblocking I am  
> talking
> about above. We block all signals. But during the signal call some are
> unblocked. When it returns they are again blocked and we can safely  
> check
> what has happened. See the article
> 
> https://lwn.net/Articles/176911/
> 
> for some discussion on the system call.

My problem with the approach was that the loop really shouldn't mess  
with signal handlers at all. If we want to fiddle with something fancy  
there's signalfd.

I have an aversion to "infrastructure in search of a user". When common  
code goes into the library, I like to be able to point to at least one  
existing user, preferably two.

> > > alarm() is used instead of the timeout argument to pselect() so  
> that
> > > the length of the intervals between timeouts is not affected by  
> being
> > > interrupted.
> >
> > QEMU recently changed its timer system to _not_ use alarm:
> >
> >    http://blog.alex.org.uk/2013/08/24/changes-to-qemus-timer-system/
> >
> > I'm not convinced sending signals to ourselves is preferable to  
> telling
> > the system call when to timeout. Of the three options (select  
> adjusting
> > its timespec, us manually reading the clock and feeding pselect a
> > timeout based on when we next want to wake up, and us sending  
> ourselves
> > an alarm), us sending ourselves an alarm and constantly having to  
> reset
> > it seems like it has the most moving parts and the most things that  
> can
> > go wrong.
> 
> select() adjusts its timespecs on linux. But that's only one possible
> behavior specified by POSIX.

Some not-linuxes don't work like linux, posix is vague enough to allow  
Windows NT to be certified...

I'm pretty happy sticking with a documented Linux semantic if it's a  
documented Linux semantic.

> Since we are already doing signals SIGALRM
> seemed more elegant then constantly adjusting the timeout to me.

Having a library I/O function modify global signal handler state is  
elegant?

> > Ok, maybe I've been slightly burned by SIGALRM before:
> >
> >    http://landley.net/notes-2011.html#05-09-2011
> >
> > > On SIGHUP syslogd restarts. Previously everything relevant was
> > > contained
> > > in syslogd_main() and the restarting was implemented by a goto.  
> Now a
> > > longjmp() would be necessary. I instead used an xexec().
> >
> > Since commit 955 (and pondering toysh to finally tackle that sucker
> > within a finite amount of time) I'm a bit worried that restarting a
> > command without OS involvement might misbehave if the signal mask  
> gets
> > corrupted, or there are funky environment variables, or memory leaks
> > build up, or we have leftover filehandles, or a couple dozen other
> > things.
> >
> > We haven't got "generic reset this process's state" code, longjmp()
> > leaves a lot of debris, and xexec() callign a new main doesn't clear
> > the old stuff off the stack; do that in a tight enough loop for long
> > enough and eventually you will run out of stack even in a normal  
> linux
> > userspace. (Default's 8 megs: ulimit -s says 8192 and the bash man  
> page
> > says the units are kilobytes.)
> >
> > I added NOFORK because some commands (like cd) _must_ happen in the
> > current toysh process context, but also because I could very very
> > carefully audit commands to make sure it was safe to call them in  
> the
> > same process context and continue when they return. But these  
> commands
> > are in pending because I haven't even done an initial audit pass on
> > them.
> >
> > Discarding your process on restart and execing a new one covers a
> > multitude of sins. Possibly syslogd is clean enough it can restart
> > itself without accumulating trash, but as a long-lived daemon I have
> > yet to closely read (hence it being in the pending directory), this
> > does not fill me with confidence.
> >
> > Probably just the "it's 5am, sun coming up soon" things. I should  
> step
> > away from the keyboard...
> 
> You are of course right. I think you have some good ideas below how to
> fix it.

I note that "fork but don't exec" is remarkably cheap, and toybox  
should probably do more of that. The downside is long chains of xexec()  
that call command_main() can potentially eat a lot of stack and heap  
and filehandles and such.

Some vague infrastructure I haven't had time to do would say "what are  
the resources used by this process", and look at memory allocations not  
yet freed, open filehandles, callstack, memory mappings, and so on. So  
far the above is just a malloc/free wrapper, checking in /proc, and  
backtrace(). Possibly some code under CONFIG_DEBUG that dumps leaked  
filehandles and such. I could clean up the stack with a longjmp()...

There are also some bad habits I need to go through the code and clean  
up. For example, if you call xexec() it'll try to free toys.optargs if  
it's not pointing to the original argv[], and if we've incremented  
optargs (which we do in a few places looping through without allocating  
a temp variable), that'll break the heap. These are commands that don't  
currently xexec(), but I have a commit pending to clean those out  
anyway because I don't want to set a bad example that gets copied.

I could also just add an xexec() counter to struct toy_context and  
force an actual exec /proc/self/exe if it hits 5 or something.  
(Decrement it in the exit path, it's not sequential calls it's  
_stacked_ calls.) But adding infrastructure is a poor substitute for  
writing code to not do that in the first place...

Which brigns us back to my objection to the signal handling. :)

> > > Since xexec()
> > > doesn't change the pid the pidfile needs to be unlinked() before
> > > terminating so that xpidfile() doesn't complain.
> >
> > Um... I agree this is a symptom of a problem. I'm not sure that's  
> the
> > right fix. The restart is maintaining state and then getting  
> confused
> > by the state it's maintaining...
> >
> > (I can just see some horrible systemd thing hitting a race window  
> where
> > the PID file wasn't there and trying to restart the process...)
> 
> Does systemd still need PID files?

I have no idea. It's a horrible thing and I'm trying to avoid it. My  
point was not assuming behavior on the part of other programs, and  
leaving a window where something doing inotify() wakes up because an  
indicator file went away when the process it represents didn't and  
isn't going to... I try to avoid that sort of halfway state where there  
can be conflicting assumptions.

(Of course if the child does an actual exec() you still have the  
recycled PID issue. This is why I prefer to "I have a real program and  
it's doing X, this causes Y" rather than trying to solve theoretical  
future problems.)

> > > Furthermore if not run
> > > in the foreground syslogd will daemonize() again when restarting.
> > > Could
> > > one detect in daemonize() whether daemonizing is necessary?
> >
> > "The restart is maintaining state and then getting..."
> >
> > Are these the _only_ instances of this? Is a main loop that exits  
> and
> > restarts itself but avoids initial setup better?
> >
> > Is there a clean way to do this where we _don't_ have to be  
> absolutely
> > sure we've thought of everything ever? execv("/proc/self/exe",
> > toys.argc) and then fall back to something else if it returns,  
> perhaps?
> > Or can we get an exhaustive list of all process resources (from the
> > container suspend/resume guys, presumably) and check for ourselves  
> that
> > we're not leaking anything?
> 
> So how about making the two helper functions for daemon_main_loop()
> return a value and possibly returning from daemon_main_loop() based on
> that value? Then in syslogd_main() add a infinite loop around (or  
> almost)
> everything.

I need to go clean up dhcpd and systemd before figuring out what the  
infrastructure should look like. If we're going to do anything fancy  
with signals I lean towards teaching this loop about signalfd, but  
probably since Linux can adjust the timeout we should just let Linux  
adjust the timeout.

> > Very old concern of mine, a full decade before containers:
> >
> >    http://lkml.indiana.edu/hypermail/linux/kernel/0206.2/0835.html
> >
> > > Felix
> > >
> > > # HG changeset patch
> > > # User Felix Janda <felix.janda at posteo.de>
> > > # Date 1378041908 -7200
> > > # Node ID 752d22ece0fe3fa9763fd0b6e231e653cc173890
> > > # Parent  587b7572aac2a64988503458155e1f2546672525
> > > Factor out daemon select() loops into a library function
> >
> > Ok, the new select() lib function I'd probably be up for reviewing  
> on
> > its own, and the cleanups to a function I haven't even read yet I'd
> > just pass through and read/cleanup the new baseline when I got  
> around
> > to it.
> 
> This patch really just implements the helper function and makes the  
> two
> toys use it. It's just a big function, causes the big main functions  
> to
> be split up and makes the self-pipe stuff unnecessary.

$ hg status toys
M toys/lsb/umount.c
M toys/other/hello.c
M toys/other/pmap.c
M toys/other/pwdx.c
M toys/other/vconfig.c
M toys/other/vmstat.c
M toys/pending/dumpleases.c
M toys/pending/su.c
M toys/posix/du.c
M toys/posix/grep.c

Lemme finish my current set of cleanups, then get back to this. :)

> > But both at once? Not up for that right now. Might try again in the
> > morning...
> 
> I wouldn't consider syslogd to be of high priority.
> 
> On the other hand I'd like (or maybe would have liked some days ago)
> to continue with cleaning it up. Now that the main function is not  
> such
> a monster anymore one can maybe inline more stuff. OTOH I'm still not
> sure about the behavior when something goes wrong when the toy starts.
> 
> If we come to an conclusion here, I'll send you a new version of the
> patch, right?

I'm finally sort of resurfacing from A) catching up on domestic life  
after 6 months out of state, B) starting a new job after the  
kickstarter idea got zero replies.

But as you can see: backlog on all fronts. I haven't forgotten about  
this, I'm just... behind. :)

Thanks for your patience,

Rob