[Toybox] ps crashes
Rob Landley
rob at landley.net
Sun Mar 12 01:04:35 PST 2017
On 03/11/2017 01:22 PM, enh wrote:
> from the line number addr2line gave above, it's the line marked with !
> below. i've included the block between too, since that modifies
> new->child.
>
> // Recurse down into tasks, retaining thread groups.
> // Disable show_process at least until we can calculate tcount
> kcount = TT.kcount;
> sprintf(toybuf, "/proc/%u/task", pid);
> new->child = dirtree_flagread(toybuf, DIRTREE_SHUTUP|DIRTREE_PROC, get_ps);
> TT.threadparent = 0;
> kcount = TT.kcount-kcount+1;
> tb = (void *)new->extra;
> tb->slot[SLOT_tcount] = kcount;
>
> // Fill out tid and thread count for each entry in group
> ! if (new->child) for (dt = new->child->child; dt; dt = dt->next) {
>
> so i assume it's that dirtree_flagread that's failing, returning
> DIRTREE_ABORTVAL, and since the code that follows just assumes it's
> either valid or null, we die on the line marked ! (as claimed by the
> stack trace).
>
> i don't know is why why dirtree_flagread is failing, though, and can't
> reproduce it.
Oh that's easy to see from context: /proc/%u exited before /proc/$u/task
could be opened, so dirtree_flagread() went "nope" and we didn't check
for it.
Asynchronous process exit while we're reading its data is something I
tried to be robust against in the rest of ps, usually by effectively
moving the process exit a milisecond or so earlier. If we couldn't read
the data because process go bye-bye how does that differ from process go
bye-bye an instant before we would have noticed it?
Looks like I messed up here going back and adding thread support because
the dirtree_flagread() semantics are non-obvious. (The
dirtree_flagread() reopen in thread scanning is a rough edge in the
design. I'm splicing together discontiguous trees manually and the
result doesn't have the checks and review the existing dirtree.c stuff
does.)
And that's my real concern here: the dirtree_flagread() semantics
_shouldn't_ be non-obvious. Should I change dirtree_flagread() to always
return NULL or valid? Why did I have it NOT do that in the first place
(to the point I commented it not doing that). I vaguely recall there's
some error condition that the caller can only distinguish if it gets the
extra error return to distinguish the case, but I don't remember _what_
the condition is. (Exists but permissions wrong to actually read its
contents, maybe?)
Alas my toybox work directory's in a bit of a shambles at the moment
because I've got half-finished make.sh changes for getconf.c and
half-finished help text parsing changes, and between the two of 'em it
doesn't _build_ at the moment. (My two weeks in tokyo and the week in
portland before that I was too busy with other stuff to close any tabs.
Happy to be home, but torn between resting and shoveling out...)
I can do a quick fix here (just add a test for abortval on that line)
but I'd rather change flagread not to return abortval, which means
understanding what I lose by doing that and being comfortable with it.
(And why the _other_ calls to flagread aren't dying more regularly with
abortval returns? How is this NOT screwing up ps with no /proc ?)
But once again, I'm only getting to this at 4am. (Ok, mostly that's
jetlag royally horking my schedule.) Might not get to this tomorrow if
getconf build infrastructure continues to be stroppy, but it's #3 on my
toybox todo list right now.
Thanks,
Rob
More information about the Toybox
mailing list