[Toybox] top overhead
enh
enh at google.com
Mon Nov 18 07:37:40 PST 2024
On Sun, Nov 17, 2024 at 2:14 AM Rob Landley <rob at landley.net> wrote:
> On 11/16/24 12:34, enh wrote:
> > toybox top's relatively high overhead bugged me again, and a couple of
> > things stood out from `strace -c`.
> >
> > one is that we call sysinfo() once for each process. this doesn't take a
> > _huge_ amount of cpu time, but it's still noticeable. procps and busybox
> > don't do this, and ironically the code comment says this is _for_ slow
> > systems:
> >
> > // /proc data is generated as it's read, so for maximum accuracy on
> slow
> > // systems (or ps | more) we re-fetch uptime as we fetch each /proc
> line.
> > sysinfo(&si);
> > slot[SLOT_uptime] = si.uptime;
> > slot[SLOT_totalram] = si.totalram;
> > slot[SLOT_upticks] = slot[SLOT_uptime]*TT.ticks -
> slot[SLOT_starttime];
> >
> > unfortunately that comment was there since the sysinfo() call was first
> > introduced, so i'm not sure what the motivating example was there? (and
> > even if it did matter, would it make more sense to just compute all the
> > "upticks" slots at the end? uptime and totalram seem like it's _less_
> > confusing if they're just measured once per snapshot.)
>
> It would have to be cached before we start reading because ps does
> progressive output (it reads a /proc entry and prints it to stdout
> before reading the next one, the sorting comes from /proc returning PID
> entries in numerical order, ala ls -1f /proc).
>
ah, i'm only ever looking at top (because that's where the excessive cpu
usage is obvious). i didn't realize /proc was sorted, even though that's
obviously the easy way to do it, because ls defaults to messing that up :-)
> I was thinking if you "ps | more" you can have quite a gap, but if the
> overhead is significant move it to the front to do once and call pausing
> the output pilot error that may give inconsistent results upticks.
>
yeah, that makes sense to me.
> We're also doing a stat() of each /proc entry to get uid and gid of the
> process (well, dirtree does but we let it), I didn't think this was more
> expensive than that? (Alas, getdents returns filetype but doesn't return
> uid/gid.)
>
the other ps'es do this too, so i was ignoring that.
> > -*-
> >
> > separately, we're doing a lot more read()s, apparently because we always
> > read /proc/pid/status in addition to stat and statm. a bit of printf
> > debugging shows that's because _PS_RUSER is set ... but i'm not sure why?
>
> Sigh, I _did_ try to use those bits to gate stuff so it didn't do more
> work than necessary to populate fields we weren't going to use. :P
>
> Let's see, it's checking TT.bits which is populated by get_headers()
> which when running "top" with no arguments is doing:
>
> PID=1
> USER=20000000
> PR=800
> NI=8
> VIRT=80000000000
> RES=100000000000
> SHR=200000000000
> S=80000000000000
> %CPU=40000000000
> %MEM=20000000000
> TIME+=4000000000
> ARGS=4000000
> 803e4024000809
>
> And PS_RUSER is 31 ala _PS_RUSER is 80000000 which is NOT set in that
> (the relevant digit is 2).
>
> The only other thing potentially assigning to those bits is
> if (FLAG(f)) TT.bits |= _PS_CMDLINE;
>
> > we're not actually showing that in the default set of fields? i tried to
> > work out where that was coming from, but it's time to get some fresh air
> > and i haven't made much progress, so i wanted to at least record/report
> > that, since this is further than i'd poked before. hacking this out is a
> > bigger improvement than hacking sysinfo() out.
>
> ... yay?
>
> It doesn't appear to be doing it here and it's still slow for me, I've
> meant to track down why and try to speed it up but I didn't think it was
> necessarily on the ps side? Debian's ps is slower:
>
> $ time bash -c 'for ((i=0; i<100; i++)); do ps -A >/dev/null; done'
> real 0m4.632s
> user 0m1.649s
> sys 0m2.926s
> $ time bash -c 'for ((i=0; i<100; i++)); do toybox ps -A >/dev/null; done'
> real 0m2.874s
> user 0m1.220s
> sys 0m1.638s
>
> Seemed most likely to be the crunch_string() stuff being terrible...
>
> > -*-
> >
> > even more confusing, i'm struggling to correlate what i'm seeing in
> strace
> > with what i expect from the code. given that ps.c is using readfile() i'd
> > expect an openat(), an fstatat() for the size, and one read(). but i
> > actually see the openat(), a short read() into a larger buffer, then a
> > read() that returns 0. oh, because the loop in readfd() is calling
> > readall(), not read(), and readall() doesn't have the same "short read =>
> > EOF" that readfd() appears to have [but doesn't actually have, because
> it's
> > calling readall() which won't return short reads except on error].
>
> We dunno how big the data is, stat on proc returns 0 for all the file
> lengths. (Understandably, but still.)
>
yeah, i'm assuming that the others know that this stuff is restricted to
fitting in a page (in most cases; maps is the main exception i know of; and
actually significantly less than a page in most other cases), so if your
supplied buffer is >= than the largest possible output...
> > this caution probably isn't necessary for procfs? (procps isn't this
> > cautious, but busybox looks a lot like toybox under strace, presumably
> also
> > calling a generic routine rather than taking advantage of any assumptions
> > about procfs.)
>
> Hmmm, I suppose that might make /proc calculate the contents twice.
> There's quite a bit of plumbing in lib/lib.c for this, readfd() can do
> an fdlength() which is silly here...
>
> Every user in ps is supplying a buffer with max length, I can do a local
> procread() that does a one shot read if you think it would help?
>
i have no particular reason to believe that (and in fact quite a lot of
reasons to believe it doesn't matter). my obsession with "why so many
read()s?" was really because every time i profile this code, it's always
the scanf()-y stuff that's taking the time.
this is also one reason i didn't mention that we open "cmdline" for every
process and the others don't --- 1. iirc that's a feature [we have
non-truncated output by default; though presumably we could pre-measure the
column width and not bother with this if we know our window is too narrow
to show bytes that aren't in /proc/pid/stat anyway; which i think is
actually the case with the default top fields in an 80-column tty?] but
more importantly 2. we don't parse bits out of cmdline, so it's way cheaper
than the others that need a scanf().
> > -*-
> >
> > but it looks like just fixing whatever's setting _PS_RUSER even though we
> > don't need it gets us a lot closer to busybox/procps top load. definitely
> > time for that fresh air though...
>
> I don't think I have your test environment here...
>
> I commented out start_redraw() so we don't go into raw mode and stuck an
> exit(1) on line 795 (right before the sprintf("%lld/status") and it
> triggered, but... Hmmm.
>
> if ((TT.bits&(_PS_RGROUP|_PS_RUSER|_PS_STAT|_PS_RUID|_PS_RGID|_PS_SWAP
> |_PS_IO|_PS_DIO)) || TT.GG.len || TT.UU.len)
>
> TT.GG.len is 0, TT.UU.len is 0, TT.bits and all that mask stuff is
> 803e4000000000 which is strange because none of the mask bits actually
> intersect? What is...
>
> dprintf(2, "mask=%llx\n",
> _PS_RGROUP|_PS_RUSER|_PS_STAT|_PS_RUID|_PS_RGID|_PS_SWAP|_PS_IO|_PS_DIO);
>
> mask=ffffffffc0000000
>
> What? These all have parentheses, right? Yes:
>
> generated/tags.h:#define _PS_RGROUP (1LL<<35)
>
> AHA! Found it:
>
> #define _PS_RUSER (1<<31)
>
> As a signed 32 bit int that's a negative number, which when type
> promoted to a signed long in the big or chain sets the top 32 bits in
> the mask.
>
> Ok, I need to to switch to 1LL one entry earlier in scripts/make.sh...
>
(/me wonders whether there are any remaining places where we have this bug,
since it seems to be a popular one!)
> Commit a51c66edb7c9. (The performance is still terrible, but that's one
> issue down...)
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20241118/0954e372/attachment-0001.htm>
More information about the Toybox
mailing list