[Toybox] ps -T

Wed Nov 1 09:05:30 PDT 2017

On Wed, Nov 1, 2017 at 1:57 AM, Rob Landley <rob at landley.net> wrote:
> On 10/19/2017 06:13 PM, enh wrote
>>> On 09/20/2017 05:08 PM, enh wrote:
>>>> ps -T doesn't really work if you have any filters. so ps -AT is fine,
>>>> but ps -p <chrome pid> -T only shows the main thread.
>>>
>>> Alas, I don't personally use threads much so basically never test this.
>>>
>>>> why? because slots[SLOT_pid] is "wrong" in shared_match_process (where
>>>> by wrong i mean "is the tid").
>>>>
>>>> why? because toybox reads (say) /proc/147047/task/147058/stat and sees
>>>>
>>>> 147058 (CompositorTileW) S 31782 6249 6249 0 -1 1077952576 4 0 0 0 0 0
>>>> 0 0 20 0 11 0 1211910244 928649216 35602 18446744073709551615
>>>> 94558515900416 94558627572512 140720560858928 140506510343072
>>>> 140506833892356 0 0 4098 1073827581 1 0 0 -1 31 0 0 0 0 0
>>>> 94558627579744 94558633602584 94558666661888 140720560866826
>>>> 140720560866928 140720560866928 140720560869342 0
>>>>
>>>> and copies 147058 into SLOT_pid because that code no longer knows the real pid.
>
> I added -H to iotop, fixed the off-by-one error in screen width
> truncation, set the screen width to 72, and ran "top -H -O TID,SHR" and
> cursored over to the SHR column and:
>
>   PID USER           TID [SHR]%CPU  %MEM     TIME+ THREAD          PROCE
>  1865 landley       1865  84M 98.1  14.8 128:47.94 thunderbird     thun+
>  1865 landley       1895  84M  1.8  14.8  41:53.81 SoftwareVsyncTh thun+
>  1865 landley        723  84M  0.0  14.8   0:00.00 StreamT~s #3048 thun+
>  1865 landley      30482  84M  0.0  14.8   2:14.82 DOM Worker      thun+
>  1865 landley       9488  84M  0.0  14.8   7:16.20 DOM Worker      thun+
>  1865 landley      16082  84M  0.0  14.8  15:33.85 DOM Worker      thun+
>  1865 landley      15523  84M  0.0  14.8   0:00.49 DOM Worker      thun+
>  1865 landley      12086  84M  0.0  14.8  17:24.69 DOM Worker      thun+
>  1865 landley       3838  84M  0.0  14.8  18:14.68 DOM Worker      thun+
>  1865 landley       2254  84M  0.0  14.8  18:26.28 DOM Worker      thun+
>  1865 landley      30746  84M  0.0  14.8  18:30.12 DOM Worker      thun+
>
> An they all seem to think they're PID 1865, but each knows its TID?

yep, that works fine.

> $ ls /proc/1865/task/
> 10228  1865  1875  1880  1887  1893  1899  1911  1934   30482  5471
> 12086  1871  1876  1881  1888  1894  1903  1912  19926  30746  6316
> 14369  1872  1877  1882  1889  1895  1904  1921  22017  32148  9488
> 15523  1873  1878  1885  1891  1897  1906  1927  2254   3838   9955
> 16082  1874  1879  1886  1892  1898  1908  1933  2770   5469
>
> Which seems reasonable? It's got PID, it's got TID, what do I need to
> fix here?

the bug i reported... :-)

i think you read the bug backwards because i included a case that *does* work...

> (Aside: thunderbird really, really, really doesn't like a local
> linux-kernel folder with 500k messages in it. Or a pop3 inbox going back
> to 2013, which is the last time I split it. But I think what it's mad
> about right now is the BUG() I hit in the vfat code which I had to fsck
> away to do some work, and thus the vfat maintainer couldn't reproduce
> it. I've meant to reboot ever since, it happened about 5 times before I
> fixed it, emergency-zapping a filesystem each time, and the memory
> menagement on this box has gone all wonky since then. I've been meaning
> to reboot to replace the keyboard anyway, but 8 desktops full of windows
> full of tabs takes a while to unwind...)
>
>>>> not sure how best to fix this.
>>>
>>> Hmmm... Reasonably straightforward to fix,
>
> Not necessarily straightfowrward to reproduce.
>
> $ ps -AT
> PID   TID
> 32667 32667 ?        00:26:22 chromium-browse
> 32667 32668 ?        00:00:00 TaskSchedulerSe
> 32667 32669 ?        00:00:20 Chrome_ChildIOT
> 32667 32670 ?        00:00:00 GpuMemoryThread

...specifically the ps -AT case...

...and a case that *doesn't* work, which is when there's a filter
*instead of* -A. i'll just repeat the original because i'm too lazy to
rewrite it and because i think the problem is just that you were just
too tired when you looked at it :-)

ps -T doesn't really work if you have any filters. so ps -AT is fine,
but ps -p <chrome pid> -T only shows the main thread.

why? because slots[SLOT_pid] is "wrong" in shared_match_process (where
by wrong i mean "is the tid").

why? because toybox reads (say) /proc/147047/task/147058/stat and sees

147058 (CompositorTileW) S 31782 6249 6249 0 -1 1077952576 4 0 0 0 0 0
0 0 20 0 11 0 1211910244 928649216 35602 18446744073709551615
94558515900416 94558627572512 140720560858928 140506510343072
140506833892356 0 0 4098 1073827581 1 0 0 -1 31 0 0 0 0 0
94558627579744 94558633602584 94558666661888 140720560866826
140720560866928 140720560866928 140720560869342 0

and copies 147058 into SLOT_pid because that code no longer knows the real pid.

i'll also add a concrete example from my laptop right now:

/tmp/toybox$ ps -T 32190
  PID  SPID TTY      STAT   TIME COMMAND
32190 32190 ?        Sl     0:00 /usr/bin/uplink-soecks
32190 32192 ?        Sl     0:01 /usr/bin/uplink-soecks
32190 32193 ?        Sl     0:00 /usr/bin/uplink-soecks
32190 32194 ?        Sl     0:02 /usr/bin/uplink-soecks
32190 32197 ?        Sl     0:02 /usr/bin/uplink-soecks
32190 32204 ?        Sl     0:02 /usr/bin/uplink-soecks
32190 32205 ?        Sl     0:00 /usr/bin/uplink-soecks
32190 32206 ?        Sl     0:02 /usr/bin/uplink-soecks
32190 32207 ?        Sl     0:00 /usr/bin/uplink-soecks
32190 21914 ?        Sl     0:02 /usr/bin/uplink-soecks
32190 21915 ?        Sl     0:01 /usr/bin/uplink-soecks
32190 31176 ?        Sl     0:01 /usr/bin/uplink-soecks
/tmp/toybox$ ./toybox ps -T 32190
  PID   TID TTY          TIME CMD
32190 32190 ?        00:00:16 uplink-soecks
/tmp/toybox$

> They're different?
>
> $ ls /proc/32667/task/
> 10087  32668  32670  32672  32674  5747  5754
> 32667  32669  32671  32673  32697  5748
>
> And they're reasonable?
>
>>> but my tree has local c
>>> changes in ps.c. Looks like I'm adding -m to show maximum number of
>>> lines (somebody asked, it's easy enough.)
>>
>> i said "no" to the single internal request we had for that when we
>> switched from traditional Android top to toybox top. easy, yes, but
>> not obviously useful. the original Android top only had batch mode, so
>> it was a bit more useful then. but "first N" isn't an obviously
>> meaningful heuristic. "field X no lower than Y" would be more
>> convincing. but that's no longer as easy :-)
>
> I can yank it again. If you wanna design a filter syntax I can probably
> implement something.

no, i don't care either way. i just wasn't going to implement
something that i personally believe isn't useful (or at least "doesn't
answer the question they actually have, and is prone to giving you too
much or too little information"). it's two-year absence has weaned
Google's Android folks off it, so i'll probably never be annoyed by a
useless bug report caused by truncation again :-)

> I've just started poking at bc ala shell $((blah)) math syntax, and I'm
> likely to make a function or two that lets you substitute in variables
> (via string substitution) and do math on the result.
>
> That said... that's not the syntax we've got in find, or in test. And ps
> has:
>
>   -o  FIELDs instead of defaults, each with optional :size and =title
>
> I could add <XXX and >XXX to that, I suppose? Might take a bit of
> fiddling to make room.
>
> That said, you still couldn't implement -m with a syntax like that.
> Maximum number of fields to display isn't a -o field. :)

yeah, but like i said: it's just not a useful thing anyway. if you
genuinely just care about lines of text, use `head`. if you're using
it as a proxy for something else, this feature would let you actually
cut off at the right point. (so personally i'd have said "no -m, but
generalized filtering is on the TODO list, waiting for enough people
to have concrete use cases".)

>>> And -H to iotop (which is
>>> where I left off; need to come up with a test for this and haven't got
>>> one. Is chrome threads or processes? The big scott mccloud comic implied
>>> processes, but the way google does everything implies threads, but
>>> threads would defeat the entire VM sandboxing purpose of having each tab
>>> in its own process...)
>>
>> if you run (GNU) "ps -AT" while chrome's running, you'll see it's a
>> mix. i think even the original design assumed that. see the first
>> diagram here, where solid boxes are processes and dashed boxes are
>> threads: https://www.chromium.org/developers/design-documents/multi-process-architecture
>
> I read Scott McCloud's comic way back when, and am currently on a plane
> to tokyo and would have to pay for network access (it's not the money,
> it's the visceral dislike of entering credit card information into a web
> page; DO NOT TRUST), but I'll try to remember to take a look.
>
> Meanwhile, I believe you. :)
>
>> here's far more detail than you could possibly want (including the
>> command-line options that let you configure the model):
>> https://www.chromium.org/developers/design-documents/process-models
>
> I'm happy that as of a couple months back:
>
>   ps ax | grep renderer | awk '{print $1}' | xargs kill

i actually slightly miss that it no longer leaks itself to death every
couple of weeks --- now i'm sometimes forced to manually restart it
when the "time to upgrade" blob has been red too long.

> works again. I haven't asked much deeper than that. (I remember the
> videos of high speed lightning discharge vs chrome page rendering. My
> experience on this netbook is more "under 5 seconds is pretty good,
> under 15 is usually tolerable", but the stop button on chrome is UTTERLY
> USELESS (doesn't even stop the notification craw at the bottom from
> TELLING you all the things it's loading, often using the monthly 4 gig
> data cap t-mobile applies to tethering but not to what the actual phone
> uses because money; I'm checking to see I've got the right youtube link
> before tweeting it, I _DON'T_ want you to spool 10 megabytes of video
> data through a metered connection).
>
> The workaround is to right click disable wifi on the networkmangler icon
> until it stops trying, usually about 15 seconds. One time out of a
> thousand it gets Confused and your network goes away and CANNOT BE FIXED
> until you reboot (ok, I've killed it, done the dbus status flush thing,
> and respawned it from the command line successfully twice, and each time
> was like half an hour of research _how_). But it was written by the same
> guys who did pulseaudio and systemd so you can't expect reliablity out
> of it.) At least these days chrome doesn't listen to the "network is
> down" dbus notification and then refuse to show you pages from the web
> server running on loopback.
>
> (Did I mention I break everything? Seriously. People kept trying to pull
> me into a tester role for the first decade of my career... and the
> combination of sleep deprivation and caffeine makes me REALLY CHATTY and
> the airport shuttle arrived at 4:20am and last week I bought a 50 pack
> of "driving chocolate" squares that are 150mg of caffeine each and
> packed ALL OF THEM. Minus the ones I already ate.)

see... i knew there was a reason you read the bug report backwards :-)

> Anyway, what I want here is something with threads to to test against,
> and both chrome and thunderbird have those, so....
>
>> and here's a Chrome engineer's "it's complicated, and everyone
>> misunderstands" post:
>> https://plus.google.com/+PeterKasting/posts/TC4ACtKevJY
>
> I follow "Security Princess" on twitter (probably @laprissa? See "no net
> right now" above). She blogs about this stuff from time to time. But I
> copied that link into a tab for when I get to Akihabra.
>
> Rob
>
> P.S. 9 open reply windows to deal with before I can close thunderbird! Woo!

-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.