<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Nov 17, 2024 at 2:14 AM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 11/16/24 12:34, enh wrote:<br>

> toybox top's relatively high overhead bugged me again, and a couple of<br>

> things stood out from `strace -c`.<br>

> <br>

> one is that we call sysinfo() once for each process. this doesn't take a<br>

> _huge_ amount of cpu time, but it's still noticeable. procps and busybox<br>

> don't do this, and ironically the code comment says this is _for_ slow<br>

> systems:<br>

> <br>

>    // /proc data is generated as it's read, so for maximum accuracy on slow<br>

>    // systems (or ps | more) we re-fetch uptime as we fetch each /proc line.<br>

>    sysinfo(&si);<br>

>    slot[SLOT_uptime] = si.uptime;<br>

>    slot[SLOT_totalram] = si.totalram;<br>

>    slot[SLOT_upticks] = slot[SLOT_uptime]*TT.ticks - slot[SLOT_starttime];<br>

> <br>

> unfortunately that comment was there since the sysinfo() call was first<br>

> introduced, so i'm not sure what the motivating example was there? (and<br>

> even if it did matter, would it make more sense to just compute all the<br>

> "upticks" slots at the end? uptime and totalram seem like it's _less_<br>

> confusing if they're just measured once per snapshot.)<br>

<br>

It would have to be cached before we start reading because ps does <br>

progressive output (it reads a /proc entry and prints it to stdout <br>

before reading the next one, the sorting comes from /proc returning PID <br>

entries in numerical order, ala ls -1f /proc).<br></blockquote><div><br></div><div>ah, i'm only ever looking at top (because that's where the excessive cpu usage is obvious). i didn't realize /proc was sorted, even though that's obviously the easy way to do it, because ls defaults to messing that up :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

I was thinking if you "ps | more" you can have quite a gap, but if the <br>

overhead is significant move it to the front to do once and call pausing <br>

the output pilot error that may give inconsistent results upticks.<br></blockquote><div><br></div><div>yeah, that makes sense to me.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

We're also doing a stat() of each /proc entry to get uid and gid of the <br>

process (well, dirtree does but we let it), I didn't think this was more <br>

expensive than that? (Alas, getdents returns filetype but doesn't return <br>

uid/gid.)<br></blockquote><div><br></div><div>the other ps'es do this too, so i was ignoring that.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> -*-<br>

> <br>

> separately, we're doing a lot more read()s, apparently because we always<br>

> read /proc/pid/status in addition to stat and statm. a bit of printf<br>

> debugging shows that's because _PS_RUSER is set ... but i'm not sure why?<br>

<br>

Sigh, I _did_ try to use those bits to gate stuff so it didn't do more <br>

work than necessary to populate fields we weren't going to use. :P<br>

<br>

Let's see, it's checking TT.bits which is populated by get_headers() <br>

which when running "top" with no arguments is doing:<br>

<br>

PID=1<br>

USER=20000000<br>

PR=800<br>

NI=8<br>

VIRT=80000000000<br>

RES=100000000000<br>

SHR=200000000000<br>

S=80000000000000<br>

%CPU=40000000000<br>

%MEM=20000000000<br>

TIME+=4000000000<br>

ARGS=4000000<br>

803e4024000809<br>

<br>

And PS_RUSER is 31 ala _PS_RUSER is 80000000 which is NOT set in that <br>

(the relevant digit is 2).<br>

<br>

The only other thing potentially assigning to those bits is<br>

   if (FLAG(f)) TT.bits |= _PS_CMDLINE;<br>

<br>

> we're not actually showing that in the default set of fields? i tried to<br>

> work out where that was coming from, but it's time to get some fresh air<br>

> and i haven't made much progress, so i wanted to at least record/report<br>

> that, since this is further than i'd poked before. hacking this out is a<br>

> bigger improvement than hacking sysinfo() out.<br>

<br>

... yay?<br>

<br>

It doesn't appear to be doing it here and it's still slow for me, I've <br>

meant to track down why and try to speed it up but I didn't think it was <br>

necessarily on the ps side? Debian's ps is slower:<br>

<br>

$ time bash -c 'for ((i=0; i<100; i++)); do ps -A >/dev/null; done'<br>

real    0m4.632s<br>

user    0m1.649s<br>

sys     0m2.926s<br>

$ time bash -c 'for ((i=0; i<100; i++)); do toybox ps -A >/dev/null; done'<br>

real    0m2.874s<br>

user    0m1.220s<br>

sys     0m1.638s<br>

<br>

Seemed most likely to be the crunch_string() stuff being terrible...<br>

<br>

> -*-<br>

> <br>

> even more confusing, i'm struggling to correlate what i'm seeing in strace<br>

> with what i expect from the code. given that ps.c is using readfile() i'd<br>

> expect an openat(), an fstatat() for the size, and one read(). but i<br>

> actually see the openat(), a short read() into a larger buffer, then a<br>

> read() that returns 0. oh, because the loop in readfd() is calling<br>

> readall(), not read(), and readall() doesn't have the same "short read =><br>

> EOF" that readfd() appears to have [but doesn't actually have, because it's<br>

> calling readall() which won't return short reads except on error].<br>

<br>

We dunno how big the data is, stat on proc returns 0 for all the file <br>

lengths. (Understandably, but still.)<br></blockquote><div><br></div><div>yeah, i'm assuming that the others know that this stuff is restricted to fitting in a page (in most cases; maps is the main exception i know of; and actually significantly less than a page in most other cases), so if your supplied buffer is >= than the largest possible output...</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> this caution probably isn't necessary for procfs? (procps isn't this<br>

> cautious, but busybox looks a lot like toybox under strace, presumably also<br>

> calling a generic routine rather than taking advantage of any assumptions<br>

> about procfs.)<br>

<br>

Hmmm, I suppose that might make /proc calculate the contents twice. <br>

There's quite a bit of plumbing in lib/lib.c for this, readfd() can do <br>

an fdlength() which is silly here...<br>

<br>

Every user in ps is supplying a buffer with max length, I can do a local <br>

procread() that does a one shot read if you think it would help?<br></blockquote><div><br></div><div>i have no particular reason to believe that (and in fact quite a lot of reasons to believe it doesn't matter). my obsession with "why so many read()s?" was really because every time i profile this code, it's always the scanf()-y stuff that's taking the time.</div><div><br></div><div>this is also one reason i didn't mention that we open "cmdline" for every process and the others don't --- 1. iirc that's a feature [we have non-truncated output by default; though presumably we could pre-measure the column width and not bother with this if we know our window is too narrow to show bytes that aren't in /proc/pid/stat anyway; which i think is actually the case with the default top fields in an 80-column tty?] but more importantly 2. we don't parse bits out of cmdline, so it's way cheaper than the others that need a scanf().</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> -*-<br>

> <br>

> but it looks like just fixing whatever's setting _PS_RUSER even though we<br>

> don't need it gets us a lot closer to busybox/procps top load. definitely<br>

> time for that fresh air though...<br>

<br>

I don't think I have your test environment here...<br>

<br>

I commented out start_redraw() so we don't go into raw mode and stuck an <br>

exit(1) on line 795 (right before the sprintf("%lld/status") and it <br>

triggered, but... Hmmm.<br>

<br>

   if ((TT.bits&(_PS_RGROUP|_PS_RUSER|_PS_STAT|_PS_RUID|_PS_RGID|_PS_SWAP<br>

                |_PS_IO|_PS_DIO)) || TT.GG.len || TT.UU.len)<br>

<br>

TT.GG.len is 0, TT.UU.len is 0, TT.bits and all that mask stuff is <br>

803e4000000000 which is strange because none of the mask bits actually <br>

intersect? What is...<br>

<br>

dprintf(2, "mask=%llx\n", <br>

_PS_RGROUP|_PS_RUSER|_PS_STAT|_PS_RUID|_PS_RGID|_PS_SWAP|_PS_IO|_PS_DIO);<br>

<br>

mask=ffffffffc0000000<br>

<br>

What? These all have parentheses, right? Yes:<br>

<br>

generated/tags.h:#define _PS_RGROUP                       (1LL<<35)<br>

<br>

AHA! Found it:<br>

<br>

#define _PS_RUSER                        (1<<31)<br>

<br>

As a signed 32 bit int that's a negative number, which when type <br>

promoted to a signed long in the big or chain sets the top 32 bits in <br>

the mask.<br>

<br>

Ok, I need to to switch to 1LL one entry earlier in scripts/make.sh...<br></blockquote><div><br></div><div>(/me wonders whether there are any remaining places where we have this bug, since it seems to be a popular one!)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Commit a51c66edb7c9. (The performance is still terrible, but that's one <br>

issue down...)<br>

<br>

Rob<br>

</blockquote></div></div>