Discussion:
ps crashes
(too old to reply)
enh
2017-03-09 18:31:49 UTC
Permalink
Raw Message
i could have sworn i mentioned this already, but i can't find any
proof that i did.

i haven't seen this crash personally, but automated testing has seen
it a few times now:


pid: 25863, tid: 25863, name: ps >>> ps <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x11
x0 0000000000000001 x1 000000557cc90468 x2 0000007fd2a2b1c8
x3 0000000000000100
x4 000000557cc9dd2c x5 000000557ccbc7e0 x6 000000007361742f
x7 000000006b736174
x8 0000000000000001 x9 0000000000000001 x10 0000007193efca80
x11 0000000000000002
x12 0000000000000005 x13 0000007194400994 x14 0000007194400638
x15 0000000000000000
x16 000000719441e208 x17 00000071943751c0 x18 0000000000000000
x19 0000007193e350c0
x20 0000000000005820 x21 000000557ccbc7d0 x22 000000557ccb97a8
x23 0000000000000412
x24 0000000000000040 x25 0000007193e35010 x26 0000000000000000
x27 0000000000000000
x28 0000000000000000 x29 0000007fd2a2b2e0 x30 000000557cc90394
sp 0000007fd2a2b2b0 pc 000000557cc903bc pstate 0000000060000000

Stack Trace:
RELADDR FUNCTION FILE:LINE
00000000000353bc get_threads+164
external/toybox/toys/posix/ps.c:905 (discriminator 1)
000000000000dabc dirtree_handle_callback+36
external/toybox/lib/dirtree.c:112
000000000000dbc8 dirtree_recurse+128
external/toybox/lib/dirtree.c:156
000000000000db14 dirtree_handle_callback+124
external/toybox/lib/dirtree.c:115
00000000000347a0 ps_main+952
external/toybox/toys/posix/ps.c:1235
0000000000013814 toy_exec+92
external/toybox/main.c:166 (discriminator 1)
00000000000133d8 toybox_main+48
external/toybox/main.c:179 (discriminator 1)
00000000000138dc main+124 external/toybox/main.c:237



given the 0x11, i'm assuming this is actually 0x10 (aka 16 aka
sizeof(void*)/2 aka struct dirtree::child) off DIRTREE_ABORTVAL.

but i still don't see how we end up with DIRTREE_ABORTVAL here, so i'm
not sure what the right fix is.

any ideas? (i'm assuming it's fallout from DIRTREE_SHUTUP, but haven't
worked out how so yet.)
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2017-03-11 11:48:28 UTC
Permalink
Raw Message
Back in Austin, recovering from jetlag...
Post by enh
i could have sworn i mentioned this already, but i can't find any
proof that i did.
i haven't seen this crash personally, but automated testing has seen
pid: 25863, tid: 25863, name: ps >>> ps <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x11
Huh, that's not good.
Post by enh
x0 0000000000000001 x1 000000557cc90468 x2 0000007fd2a2b1c8
Guessing from register size/count this is arm64?
Post by enh
RELADDR FUNCTION FILE:LINE
00000000000353bc get_threads+164
So +164 would be about... No idea, but presumably not right at the start
of the function? Which rules out new being funky...
Post by enh
external/toybox/toys/posix/ps.c:905 (discriminator 1)
000000000000dabc dirtree_handle_callback+36
external/toybox/lib/dirtree.c:112
000000000000dbc8 dirtree_recurse+128
external/toybox/lib/dirtree.c:156
000000000000db14 dirtree_handle_callback+124
external/toybox/lib/dirtree.c:115
00000000000347a0 ps_main+952
external/toybox/toys/posix/ps.c:1235
0000000000013814 toy_exec+92
external/toybox/main.c:166 (discriminator 1)
00000000000133d8 toybox_main+48
external/toybox/main.c:179 (discriminator 1)
00000000000138dc main+124 external/toybox/main.c:237
given the 0x11, i'm assuming this is actually 0x10 (aka 16 aka
sizeof(void*)/2 aka struct dirtree::child) off DIRTREE_ABORTVAL.
Ok, I'm wrong, sounds like arm64 takes 164 bytes to clear its throat,
because that sounds like:

static int get_threads(struct dirtree *new)
{
struct dirtree *dt;
struct carveup *tb;
unsigned pid, kcount;

if (!new->parent) return get_ps(new);
pid = atol(new->name);

should be faulting on the very first line (where we have a read
dereferencing a null-ish pointer).
Post by enh
but i still don't see how we end up with DIRTREE_ABORTVAL here, so i'm
not sure what the right fix is.
ABORTVAL is what you return when you want the recursion to stop, it
means you found what you're looking for. Why it would be passed _into_
the callback is a question: that shouldn't happen.
Post by enh
any ideas? (i'm assuming it's fallout from DIRTREE_SHUTUP, but haven't
worked out how so yet.)
It sounds like /proc isn't mounted? Comment in lib/ditree.c right above
dirtree_flagread() says:

// Returns DIRTREE_ABORTVAL if path didn't exist (use DIRTREE_SHUTUP
// to handle error message yourself).

Which is because handle_callback() returns DIRTREE_ABORTVAL if it's
passed a null pointer for new. And our various calls to
dirtree_flagread() in ps.c seem to assume NULL as the error return,
which isn't right.

Except... the only use of get_threads() in ps is line 1237 where we feed
it as an argument to dirtree_flagread() and that _should_ be getting
this right. The callback shouldn't get called back with ABORTVAL in the
first place? It's filtered out in dirtree.c line 157, which is the only
caller of handle_callback in the file other than dirtree_flagread()...

I need to look at this when I'm more awake. And yeah, that got fiddled
with in DIRTREE_SHUTUP. Hmmm, how would I reproduce this... I just tried
a chroot with /proc not mounted (and again with no /proc directory) and
ps just gave me the header line with no contents, no segfault...

Rob
enh
2017-03-11 19:22:17 UTC
Permalink
Raw Message
Post by Rob Landley
Back in Austin, recovering from jetlag...
Post by enh
i could have sworn i mentioned this already, but i can't find any
proof that i did.
i haven't seen this crash personally, but automated testing has seen
pid: 25863, tid: 25863, name: ps >>> ps <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x11
Huh, that's not good.
Post by enh
x0 0000000000000001 x1 000000557cc90468 x2 0000007fd2a2b1c8
Guessing from register size/count this is arm64?
Post by enh
RELADDR FUNCTION FILE:LINE
00000000000353bc get_threads+164
So +164 would be about... No idea, but presumably not right at the start
of the function? Which rules out new being funky...
Post by enh
external/toybox/toys/posix/ps.c:905 (discriminator 1)
000000000000dabc dirtree_handle_callback+36
external/toybox/lib/dirtree.c:112
000000000000dbc8 dirtree_recurse+128
external/toybox/lib/dirtree.c:156
000000000000db14 dirtree_handle_callback+124
external/toybox/lib/dirtree.c:115
00000000000347a0 ps_main+952
external/toybox/toys/posix/ps.c:1235
0000000000013814 toy_exec+92
external/toybox/main.c:166 (discriminator 1)
00000000000133d8 toybox_main+48
external/toybox/main.c:179 (discriminator 1)
00000000000138dc main+124 external/toybox/main.c:237
given the 0x11, i'm assuming this is actually 0x10 (aka 16 aka
sizeof(void*)/2 aka struct dirtree::child) off DIRTREE_ABORTVAL.
Ok, I'm wrong, sounds like arm64 takes 164 bytes to clear its throat,
static int get_threads(struct dirtree *new)
{
struct dirtree *dt;
struct carveup *tb;
unsigned pid, kcount;
if (!new->parent) return get_ps(new);
pid = atol(new->name);
should be faulting on the very first line (where we have a read
dereferencing a null-ish pointer).
from the line number addr2line gave above, it's the line marked with !
below. i've included the block between too, since that modifies
new->child.

// Recurse down into tasks, retaining thread groups.
// Disable show_process at least until we can calculate tcount
kcount = TT.kcount;
sprintf(toybuf, "/proc/%u/task", pid);
new->child = dirtree_flagread(toybuf, DIRTREE_SHUTUP|DIRTREE_PROC, get_ps);
TT.threadparent = 0;
kcount = TT.kcount-kcount+1;
tb = (void *)new->extra;
tb->slot[SLOT_tcount] = kcount;

// Fill out tid and thread count for each entry in group
! if (new->child) for (dt = new->child->child; dt; dt = dt->next) {

so i assume it's that dirtree_flagread that's failing, returning
DIRTREE_ABORTVAL, and since the code that follows just assumes it's
either valid or null, we die on the line marked ! (as claimed by the
stack trace).

i don't know is why why dirtree_flagread is failing, though, and can't
reproduce it.
Post by Rob Landley
Post by enh
but i still don't see how we end up with DIRTREE_ABORTVAL here, so i'm
not sure what the right fix is.
ABORTVAL is what you return when you want the recursion to stop, it
means you found what you're looking for. Why it would be passed _into_
the callback is a question: that shouldn't happen.
Post by enh
any ideas? (i'm assuming it's fallout from DIRTREE_SHUTUP, but haven't
worked out how so yet.)
It sounds like /proc isn't mounted? Comment in lib/ditree.c right above
// Returns DIRTREE_ABORTVAL if path didn't exist (use DIRTREE_SHUTUP
// to handle error message yourself).
Which is because handle_callback() returns DIRTREE_ABORTVAL if it's
passed a null pointer for new. And our various calls to
dirtree_flagread() in ps.c seem to assume NULL as the error return,
which isn't right.
Except... the only use of get_threads() in ps is line 1237 where we feed
it as an argument to dirtree_flagread() and that _should_ be getting
this right. The callback shouldn't get called back with ABORTVAL in the
first place? It's filtered out in dirtree.c line 157, which is the only
caller of handle_callback in the file other than dirtree_flagread()...
I need to look at this when I'm more awake. And yeah, that got fiddled
with in DIRTREE_SHUTUP. Hmmm, how would I reproduce this... I just tried
a chroot with /proc not mounted (and again with no /proc directory) and
ps just gave me the header line with no contents, no segfault...
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2017-03-12 09:04:35 UTC
Permalink
Raw Message
Post by enh
from the line number addr2line gave above, it's the line marked with !
below. i've included the block between too, since that modifies
new->child.
// Recurse down into tasks, retaining thread groups.
// Disable show_process at least until we can calculate tcount
kcount = TT.kcount;
sprintf(toybuf, "/proc/%u/task", pid);
new->child = dirtree_flagread(toybuf, DIRTREE_SHUTUP|DIRTREE_PROC, get_ps);
TT.threadparent = 0;
kcount = TT.kcount-kcount+1;
tb = (void *)new->extra;
tb->slot[SLOT_tcount] = kcount;
// Fill out tid and thread count for each entry in group
! if (new->child) for (dt = new->child->child; dt; dt = dt->next) {
so i assume it's that dirtree_flagread that's failing, returning
DIRTREE_ABORTVAL, and since the code that follows just assumes it's
either valid or null, we die on the line marked ! (as claimed by the
stack trace).
i don't know is why why dirtree_flagread is failing, though, and can't
reproduce it.
Oh that's easy to see from context: /proc/%u exited before /proc/$u/task
could be opened, so dirtree_flagread() went "nope" and we didn't check
for it.

Asynchronous process exit while we're reading its data is something I
tried to be robust against in the rest of ps, usually by effectively
moving the process exit a milisecond or so earlier. If we couldn't read
the data because process go bye-bye how does that differ from process go
bye-bye an instant before we would have noticed it?

Looks like I messed up here going back and adding thread support because
the dirtree_flagread() semantics are non-obvious. (The
dirtree_flagread() reopen in thread scanning is a rough edge in the
design. I'm splicing together discontiguous trees manually and the
result doesn't have the checks and review the existing dirtree.c stuff
does.)

And that's my real concern here: the dirtree_flagread() semantics
_shouldn't_ be non-obvious. Should I change dirtree_flagread() to always
return NULL or valid? Why did I have it NOT do that in the first place
(to the point I commented it not doing that). I vaguely recall there's
some error condition that the caller can only distinguish if it gets the
extra error return to distinguish the case, but I don't remember _what_
the condition is. (Exists but permissions wrong to actually read its
contents, maybe?)

Alas my toybox work directory's in a bit of a shambles at the moment
because I've got half-finished make.sh changes for getconf.c and
half-finished help text parsing changes, and between the two of 'em it
doesn't _build_ at the moment. (My two weeks in tokyo and the week in
portland before that I was too busy with other stuff to close any tabs.
Happy to be home, but torn between resting and shoveling out...)

I can do a quick fix here (just add a test for abortval on that line)
but I'd rather change flagread not to return abortval, which means
understanding what I lose by doing that and being comfortable with it.
(And why the _other_ calls to flagread aren't dying more regularly with
abortval returns? How is this NOT screwing up ps with no /proc ?)

But once again, I'm only getting to this at 4am. (Ok, mostly that's
jetlag royally horking my schedule.) Might not get to this tomorrow if
getconf build infrastructure continues to be stroppy, but it's #3 on my
toybox todo list right now.

Thanks,

Rob
enh
2017-03-22 18:18:25 UTC
Permalink
Raw Message
(For those following along at home, a fix was merged yesterday.)
Post by enh
Post by enh
from the line number addr2line gave above, it's the line marked with !
below. i've included the block between too, since that modifies
new->child.
// Recurse down into tasks, retaining thread groups.
// Disable show_process at least until we can calculate tcount
kcount = TT.kcount;
sprintf(toybuf, "/proc/%u/task", pid);
new->child = dirtree_flagread(toybuf, DIRTREE_SHUTUP|DIRTREE_PROC,
get_ps);
Post by enh
TT.threadparent = 0;
kcount = TT.kcount-kcount+1;
tb = (void *)new->extra;
tb->slot[SLOT_tcount] = kcount;
// Fill out tid and thread count for each entry in group
! if (new->child) for (dt = new->child->child; dt; dt = dt->next) {
so i assume it's that dirtree_flagread that's failing, returning
DIRTREE_ABORTVAL, and since the code that follows just assumes it's
either valid or null, we die on the line marked ! (as claimed by the
stack trace).
i don't know is why why dirtree_flagread is failing, though, and can't
reproduce it.
Oh that's easy to see from context: /proc/%u exited before /proc/$u/task
could be opened, so dirtree_flagread() went "nope" and we didn't check
for it.
Asynchronous process exit while we're reading its data is something I
tried to be robust against in the rest of ps, usually by effectively
moving the process exit a milisecond or so earlier. If we couldn't read
the data because process go bye-bye how does that differ from process go
bye-bye an instant before we would have noticed it?
Looks like I messed up here going back and adding thread support because
the dirtree_flagread() semantics are non-obvious. (The
dirtree_flagread() reopen in thread scanning is a rough edge in the
design. I'm splicing together discontiguous trees manually and the
result doesn't have the checks and review the existing dirtree.c stuff
does.)
And that's my real concern here: the dirtree_flagread() semantics
_shouldn't_ be non-obvious. Should I change dirtree_flagread() to always
return NULL or valid? Why did I have it NOT do that in the first place
(to the point I commented it not doing that). I vaguely recall there's
some error condition that the caller can only distinguish if it gets the
extra error return to distinguish the case, but I don't remember _what_
the condition is. (Exists but permissions wrong to actually read its
contents, maybe?)
Alas my toybox work directory's in a bit of a shambles at the moment
because I've got half-finished make.sh changes for getconf.c and
half-finished help text parsing changes, and between the two of 'em it
doesn't _build_ at the moment. (My two weeks in tokyo and the week in
portland before that I was too busy with other stuff to close any tabs.
Happy to be home, but torn between resting and shoveling out...)
I can do a quick fix here (just add a test for abortval on that line)
but I'd rather change flagread not to return abortval, which means
understanding what I lose by doing that and being comfortable with it.
(And why the _other_ calls to flagread aren't dying more regularly with
abortval returns? How is this NOT screwing up ps with no /proc ?)
But once again, I'm only getting to this at 4am. (Ok, mostly that's
jetlag royally horking my schedule.) Might not get to this tomorrow if
getconf build infrastructure continues to be stroppy, but it's #3 on my
toybox todo list right now.
Thanks,
Rob
Rob Landley
2017-03-22 19:37:29 UTC
Permalink
Raw Message
Post by enh
(For those following along at home, a fix was merged yesterday.)
And the reason I didn't change what lib/dirtree.c was doing is stuff
like "mv" is making use of the distinction. :P

Rob
enh
2017-06-15 21:27:54 UTC
Permalink
Raw Message
a new report today (from a build from yesterday that does have the `if
(new->child == DIRTREE_ABORTVAL) new->child = 0;` patch):

pid: 16725, tid: 16725, name: ps >>> ps <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x10
x0 0000000000000000 x1 0000000000000000 x2 0000000000000000
x3 0000007fc05036e8
x4 0000007ccc157174 x5 0000007ccc01d08b x6 7365636f72702e64
x7 616964656d2e7373
x8 0000000000000000 x9 0000000000000001 x10 0000000000004001
x11 0000000000000000
x12 000000000000000f x13 0000007ccc790c94 x14 0000007ccc790938
x15 0000000000000001
x16 0000007ccc7ae2c0 x17 0000007ccc74d80c x18 000000000000000d
x19 0000007ccc0350c0
x20 0000000000000000 x21 000000556e5bf790 x22 000000556e5bc768
x23 0000000000000654
x24 0000000000000040 x25 0000007ccc035010 x26 0000000000000000
x27 0000000000000000
x28 0000000000000000 x29 0000007fc0503ee0 x30 000000556e593470
sp 0000007fc0503eb0 pc 000000556e593478 pstate 0000000080000000

Stack Trace:
RELADDR FUNCTION FILE:LINE
0000000000034478 get_threads+228
external/toybox/toys/posix/ps.c:917 (discriminator 1)
000000000000da2c dirtree_handle_callback+36
external/toybox/lib/dirtree.c:112
000000000000db38 dirtree_recurse+128
external/toybox/lib/dirtree.c:156
000000000000da84 dirtree_handle_callback+124
external/toybox/lib/dirtree.c:115
0000000000033808 ps_main+952
external/toybox/toys/posix/ps.c:1238
0000000000013770 toy_exec+92
external/toybox/main.c:166 (discriminator 1)
0000000000013334 toybox_main+48
external/toybox/main.c:179 (discriminator 1)
0000000000013838 main+124 external/toybox/main.c:237

as before, this was just a crash collected during testing. no repro
steps and (afaik) this is the first crash since March :-/
Post by Rob Landley
Post by enh
(For those following along at home, a fix was merged yesterday.)
And the reason I didn't change what lib/dirtree.c was doing is stuff
like "mv" is making use of the distinction. :P
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2017-06-15 21:51:22 UTC
Permalink
Raw Message
Post by enh
a new report today (from a build from yesterday that does have the `if
Sigh.
Post by enh
pid: 16725, tid: 16725, name: ps >>> ps <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x10
x0 0000000000000000 x1 0000000000000000 x2 0000000000000000
x3 0000007fc05036e8
x4 0000007ccc157174 x5 0000007ccc01d08b x6 7365636f72702e64
x7 616964656d2e7373
x8 0000000000000000 x9 0000000000000001 x10 0000000000004001
x11 0000000000000000
x12 000000000000000f x13 0000007ccc790c94 x14 0000007ccc790938
x15 0000000000000001
x16 0000007ccc7ae2c0 x17 0000007ccc74d80c x18 000000000000000d
x19 0000007ccc0350c0
x20 0000000000000000 x21 000000556e5bf790 x22 000000556e5bc768
x23 0000000000000654
x24 0000000000000040 x25 0000007ccc035010 x26 0000000000000000
x27 0000000000000000
x28 0000000000000000 x29 0000007fc0503ee0 x30 000000556e593470
sp 0000007fc0503eb0 pc 000000556e593478 pstate 0000000080000000
RELADDR FUNCTION FILE:LINE
0000000000034478 get_threads+228
external/toybox/toys/posix/ps.c:917 (discriminator 1)
Line 917 is while (dt->child)... any chance _that_ got set to
DIRTREE_ABORTVAL?

Grrr.

The problem with the help plumbing is is collates hunks with matching
"usage: cmdname" lines and the common blocks in ps are trying to do
"usage: *" and have the "depends on" info specify which block to
collate, but config2help.c isn't actually using the depends on info.
(It's collecting it and putting it in the structures, but nothing's
using it.)

So I switched it the ps hunks to say "usage: top|iotop" and taught the
plumbing to iterate through those and match the currently active name
against any of them.

The problem is it's iterating through the hunks BACKWARDS (and the
collating code is depending on that to get the text chunks in the right
order), so grabs "top|iotop" as the name to _match_ so the extra
iteration I added is in the wrong place.

Working on it. Lemme finish the help thing, cut the release, then I have
pending pings from Josh Gao and Ilya Kuzmich, and then this is next on
the list.

(Mostly through the cold though. that was an unpleasant couple of weeks.)

Rob

Loading...