[Toybox] memcpy overlap in ps

Wed Oct 5 03:15:57 PDT 2016

On 10/03/2016 07:22 PM, enh wrote:
> from the AOSP gerrit (fixing internal bug 30074257). i've been meaning
> to look at this more closely for a couple of months, but haven't found
> the time. i too wasn't sure whether switching to memmove was actually
> the fix or just papering over a real problem...

Since it took me a little to work through it too, and you're "not sure",
here's a refresher on the logic. (Feel free to skip down to the =======
where I start talking about the code directly setting up the memcpy this
patch would change.)

The code this patch touches (line numbers are circa commit 703c49e0cb97)
initializes the miscellaneous string slots, which store data we didn't
get out of /proc/$PID/stat but came from somewhere else under /proc/$PID.

This code lives in the function get_ps(). The miscellanous string
processing depends on an array of structures called fetch[] at the start
of get_ps() (line 601) listing various files under the current
/proc/$PID, each with a set of field bits that tell us when we need this
data (so we don't open and read from files our -o BLAH,BLAH,BLAH list
isn't displaying). Down around line 728 we iterate through the fetch[]
array, opening and reading each file and doing the additional processing
it requires. Each of these files produces variable length string data,
which we append to tb->str and store an array of tb->offset[] values to
where each one starts. (Well, after the first; the first is at offset
zero so we don't need to store it.)

All this code lives in function get_ps(), which reads all the /proc data
for a given process into "toybuf" in one gulp, and then at the end of
the function (line 850) either displays it immediately (if we haven't
got -k) or copies into into a malloced buffer attached tree we later
turn into an array and sort before displaying it. Either way, we filter
out the nodes we don't want (line 693) and return about halfway through
get_ps() if it's not a process we eventually want to display.
This means that all the data we read about each process, including the
miscelaneous string slots, has to collectively fit in 4096 bytes.

The variable name "tb" is short for "toybuf" bcause it's the structure I
was was overlaying on toybuf to store the data in, and at the top of the
function (line 610) it's pointed to toybuf (and on line 621 toybuf is
memset back to all zeroes), so that's what all its fields are scribbling
over as we read in data: "struct carveup *tb" is how we carve up toybuf
into individual fields.

Before we read the miscelaneous string data, lines 621 through 726 read
a bunch of other stuff into toybuf first (starting with all the numeric
data from the file "stat" going into the long long tb->slot[SLOT_count]
array, which is a TAGGED_ARRAY as described in the checkin comment of
commit f96bb3d8e7ec). Then we read other stuff out of the files
"status", "io", "statm", "exe", and android's get_sched_policy().

By the time we start iterating through that fetch[] array to initialize
the miscelanous string slots, we've used maybe 1/4 of toybuf's 4096
bytes, and a lot of these strings can be arbitrarily long (on modern
kernels argv[] data can be gigabytes), so we give each one a size budget
(see line 740). The theory is each field we HAVEN'T read yet has a
minimum reserved size (256 bytes), but any space that wasn't used by
fields we've already read gets added to the current field's budget, so
if one of them needs to be 2k it can without totally squeezing out the
others. (This gives us more space for string matching or displaying in
-w mode.)

(Note: there's a design tension here: processes can go away at any time,
they can exit _while_ we're reading them, so we snapshot the data for
each process all at once (and if we fail halfway through we discard the
process; if it had gone away a fraction of a second earlier we wouldn't
have seen it at all). The only AFTER we've read everything do we sort
and/or display it. We never go back and read more data from a process
after get_ps() returns, to avoid race conditions. But we don't want
unlimited memory usage (imagine a process with 1000 threads and a
megabyte-long command line), hence the 4k cap per process on data we
store. Alas, this means if your /proc/self/exe symlink is 3000 bytes
long, it's trucated to the space budget we had for reading that
miscellaneous string slot.)

The "buf" pointer advances through toybuf as we use it, and it always
points to the next unused byte in toybuf. So on line 742 we set "len" to
how much space this miscellaneous string field is allowed to use, and
use it as temporary scratch space to write the path to the filename we
want to look at into buf. ("fd" is an open filehandle to /proc so we can
openat() or readlinkat() paths relative to that. The upcoming read or
readlink command will overwrite the filename data with the data it fetches.)

Then we special case the fetch[] entries were we DON'T just open a file
and read all its contents, but instead do something like readlink() it,
and that's slots 0, 3 and 5. For slot 3 we're doing a
readlink("/proc/$PID/exe"), slot 0 does funky tty processing (most of
which Elliott wrote, doing a stat() on fd/* and comparing the major and
minor numbers with /proc/tty/drivers, which should possibly be cached so
we're not reopening it so much; but the reason it's in miscellaneous
string processing is the result is a string of unknown length that we
need to store until we're ready to display it).

Slot 5 is a funny one having to do with threads. A thread has to look at
its parent process in order to get some of its data (its own stat entry
has some blank fields), but the display logic is only ever looking at
ONE process's stored block of "struct carveup" toybuf data. By the time
the need for one process to look at ANOTHER process to fill out its tb
fields came up, the ps design really really didn't want to do that.

So what I did was add a global variable (TT.threadparent, see line 751)
that lets get_ps() access the parent of the thread its currently looking
at. (And know that when this variable is NULL, then it's not looking at
a thread.) Only the data reading function has this, the display function
can't because we only store the processes we're displaying, the filter
function may have discarded that other process before the display
function runs. If we're not showing the thread parent, we can't access
its data at display time. So for this extra thread data, we have to copy
fields from one process into another in the reading function, and that's
what "ptb" is for: parent tb.

If you're trying to make sense of this logic, sometimes to figure out
what should be IN these fields it helps to look at the consumer, I.E.
the display logic. At display time, the function string_field() (which
converts a field to a string) uses the typos[] array (see line 337, and
that TAGGED_ARRAY git commit mentioned earlier) to match each -o THINGY
field in sequence to figure out what we're displaying (lines 1178 to
1193 load the TT.fields list). Then there's a big if/else staircase
providing additional behavior to them (starting on line 426). The "slot"
field in each typos[] entry can be negative, which means display one of
the miscellaneous strings.

This means fetch[5] becomes -7 in typos[x].slot, first because you can't
have -0 so -1 means fetch[0], and second because the first entry is at
offset 0 into the tb->str data so -1 doesn't need a slot to record the
constant 0 -- that's what the comment on line 443 is talking about-- and
is loaded outside the miscellaneous string loop (it's the "real command
name" we read in from "stat" when we were initializing the slot[] array,
see line 643), so in this case fetch[5] is actually -7, which is the
value used in the -o "NAME" entry in typos (line 349). This also matches
the fetch bits entry on line 608. According to the help text on line 96,
NAME is "Process name (argv[0] of $PID)", meaning if we're a thread it's
our parent's argv[0], and if not it's ours. When we copy it we're
stripping the path components off (line 761) and only keeping the
filename part.

========= back to the action

So what we're doing around line 758 is copying everything after the last
'/' from fetch[4] (which is "cmdline" on line 607) into fetch[5], and
those can't overlap even if ptb = tb because tb->offset[5] is after the
end of tb->offset[4]. (Note, the "256 bytes of reserved space" is not
just a nice round number, it's also the maximum size a filename can
have, with null terminator, as enforced by the VFS.)

So the thread stuff we're doing for fetch[5] is fine, but fetch[3] also
has the "ptb vs tb" check with the possible need to copy data out of the
parent. We only readlinkat0() (which is just a readlinkat() that's
guaranteed to null terminate) when ptb is  null, otherwise we copy our
parent's data (because the thread hasn't got a file to readlink).

But when j == 3, we only go into the else case with the memcpy() when
ptb _isn't_ null, in which case s = ptb->str + offset and that's a
pointer to our parent's struct carveup tb memory block. So the memcpy is
copying from a malloc to toybuf, which can't overlap.

The other potential hiccup is SLOT_argv0len, ala:

          if (!ptb || tb->slot[SLOT_argv0len]) ptb = tb;
          i = ptb->slot[SLOT_argv0len];
          s = ptb->str+ptb->offset[4];

Could we be using an improperly initialized argv0len? When ptb is NULL
but this thread's SLOT_argv0len isn't (which means it's not a thread but
a process), then we grab "cmdline" and use the argv0len set on line 842
(by the previous iteration of this loop when j==4). But since
fetch[4].bits has _PS_NAME in it (line 607), then when we're processing
fetch[5] we _already_ processed fetch[4] for the same process (I.E. we
didn't skip it) and thus it set SLOT_argv0len properly for us. (It's the
last entry in the list with default processing, so as the comment on
line 841 says it sets the argv0len value that stays. Of course we have
to be _after_ it in the list to use the value it set when we're _not_
operating on a thread parent but instead copying from ourselves, which
is why we're fetch[5].

What argv0len _not_ being 0 says is /proc/self/cmdline existed and
wasn't empty, in which case we want to use its value.

So I don't see the codepath the address sanitizer's complaining about
where we copy over ourselves? (I thought it was because we were
potentially doing an in-place basename where we might copy the filename
after the / over the earlier path components, and that might overlap.
But the source is ptb->str plus either ptb->offset[3] or ptb->offset[4],
and even when ptb _is_ pointing to tb both of those should lower than
ptb->offset[5] and the length (i) should end before the start of
ptb->offset[5] (either because the offset[3] strlen() finds the null
terminated string or because we used slot[SLOT_argv0len] which should be
less than or equal to the strlen of the "cmdline" data in fetch[4]. (And
no it's not us failing to account for the null terminator, line 846 adds
the +1 when advancing buf for that reason.)

So if anybody can tell me what the address sanitizer is complaining
about, I'm all ears? That said, I can still replace the memcpy() with
memmove() to avoid triggering what _seems_ like a false positive in your
address sanitizer, if you like...

> Evgenii Stepanov has uploaded a new change for review.
> 
> ( https://googleplex-android-review.git.corp.google.com/1504922 )

Which wants a google corporate login.

Rob