Discussion:
[PATCH] Fix the operator precedence in expr.
(too old to reply)
Andy Chu
2016-03-15 04:29:32 UTC
Permalink
This fixes 3 tests, and the delta in expr.c is +11 lines -- all
comments. The actual code is shorter.

It's a bit unnatural to me but I tried to compress the code, following
the existing style.

I'm going to fix the type coercion problems too, which will piggyback
on this change. Basically the OPS lookup table needs have another
field to specify coercion rules for the operation. The type
conversion has to be done with respect to the operation being
evaluated rather than globally up front.

# fails with type error, '21' should be coerced to integer and then added to 3
expr ab21xx : '[^0-9]*\([0-9]*\)' + 3

# segfaults because '3' is coerced to integer at parse time, and we
pass NULL to regexec
expr 3 : '\(.\)'

-----

expr now uses the precedence table specified by POSIX, implemented using
the "precedence climbing" algorithm. See the references at the top of
eval_expr().

This fixes 3 of 4 failing tests.

I also added more tests for correct behavior and for syntax errors.
This includes a new test exposing a segfault, related to type coercion.
Andy Chu
2016-03-15 21:15:28 UTC
Permalink
Here is a second patch on top of the first one. Between these two
patches, it's pretty much a complete rewrite -- different algorithm,
data structures, and code structure.

It's back at 276 lines, which I think compares pretty favorably to
busybox at 538 lines:

http://code.metager.de/source/xref/busybox/coreutils/expr.c

One thing is that I don't quite understand the memory management
strategy. I don't know exactly what CFG_TOYBOX_FREE is for. The code
already had two issues:

- xmprintf in re() leaks I think
- in num_to_str, the same static buffer was used for multiple strings,
which is a latent bug because you could strcmp() the same buffer
against itself

I changed it to use xmalloc, which will cause a memory leak too. I
guess if I want to clean up properly, I should keep a list of pointers
in GLOBALS and free them all at the end in expr_main() ?

BTW this 0001- patch is more up to date than the one in the previous
e-mail (I used my @google.com e-mail address because my employer wants
that -- it's stricter now due to all the lawsuits...)

Andy

-----

[PATCH 2/2] Fix type coercion bugs in expr.

All tests pass now; this fixes the 2 remaining failures, including a
segfault.

The structure of the code has changed a lot -- instead of having a tiny
function per operator, we have eval_op() which does common type coercion
and then evaluates the operator. I tried writing it a couple different
ways, and this was the cleanest.

The OPS table now contains the operator string, precedence level,
signature for type coercion, and operator ID.
Post by Andy Chu
This fixes 3 tests, and the delta in expr.c is +11 lines -- all
comments. The actual code is shorter.
It's a bit unnatural to me but I tried to compress the code, following
the existing style.
I'm going to fix the type coercion problems too, which will piggyback
on this change. Basically the OPS lookup table needs have another
field to specify coercion rules for the operation. The type
conversion has to be done with respect to the operation being
evaluated rather than globally up front.
# fails with type error, '21' should be coerced to integer and then added to 3
expr ab21xx : '[^0-9]*\([0-9]*\)' + 3
# segfaults because '3' is coerced to integer at parse time, and we
pass NULL to regexec
expr 3 : '\(.\)'
-----
expr now uses the precedence table specified by POSIX, implemented using
the "precedence climbing" algorithm. See the references at the top of
eval_expr().
This fixes 3 of 4 failing tests.
I also added more tests for correct behavior and for syntax errors.
This includes a new test exposing a segfault, related to type coercion.
Rob Landley
2016-03-15 21:26:48 UTC
Permalink
Post by Andy Chu
Here is a second patch on top of the first one. Between these two
patches, it's pretty much a complete rewrite -- different algorithm,
data structures, and code structure.
It's back at 276 lines, which I think compares pretty favorably to
http://code.metager.de/source/xref/busybox/coreutils/expr.c
One thing is that I don't quite understand the memory management
strategy. I don't know exactly what CFG_TOYBOX_FREE is for.
Every config symbol should have menuconfig help text. If it's
not clear what it's for from that, I need to fix the help text.

config TOYBOX_FREE
bool "Free memory unnecessarily"
default n
help
When a program exits, the operating system will clean up after it
(free memory, close files, etc). To save size, toybox usually relies
on this behavior. If you're running toybox under a debugger or
without a real OS (ala newlib+libgloss), enable this to make toybox
clean up after itself.
Things in pending are usually in pending for more than one reason. :)

(If it was an easy fix I'd already have done it...)
Post by Andy Chu
- xmprintf in re() leaks I think
- in num_to_str, the same static buffer was used for multiple strings,
which is a latent bug because you could strcmp() the same buffer
against itself
That's ok if the users track the allocations, but not having done
the cleanup on it I couldn't tell you.

(I wouldn't want lib code to do that without a BIG COMMENT, but
local to a command I assume the command author knows what they're
doing. At least when it's not in pending and I've _confirmed_ they
know what they're doing, at least well enough to fool me. :)
Post by Andy Chu
I changed it to use xmalloc, which will cause a memory leak too. I
guess if I want to clean up properly, I should keep a list of pointers
in GLOBALS and free them all at the end in expr_main() ?
If it's at the end of expr_main() we don't care. However, if leaks
per-input and we have unlimited input... (I believe the current linux
environment space limit is INT_MAX arguments each INT_MAX long, modulo
ulimit applying something to that maybe?)

This sounds per-argument, so I'd want to free it when it's done
with it if we can. (I need to read the code after applying your
patches.)
Post by Andy Chu
BTW this 0001- patch is more up to date than the one in the previous
that -- it's stricter now due to all the lawsuits...)
Works for me.

I'm currently squinting at 3 different FAT docs, but I'll try to
take a look at this after dinner.
Post by Andy Chu
Andy
Thanks,

Rob
Andy Chu
2016-03-15 21:46:36 UTC
Permalink
Post by Rob Landley
If it's at the end of expr_main() we don't care. However, if leaks
per-input and we have unlimited input... (I believe the current linux
environment space limit is INT_MAX arguments each INT_MAX long, modulo
ulimit applying something to that maybe?)
This sounds per-argument, so I'd want to free it when it's done
with it if we can. (I need to read the code after applying your
patches.)
OK great, some notes on memory management for when you review:

expr is a pure function of the argv, and the implementation almost
always uses the constant argv strings, which we don't need to manage.
The two exceptions are:

- re() -- when you do expr foo : 'f\(.*\)', the resulting captured
string "oo" has to go somewhere
- type coercion of integers: if you do expr 2 + 3 \< abcd -- 5 on the
LHS is converted to "5" by get_str(), called from eval_op().

So if you don't do any of those things, you won't get any leaks. In
the worst case, you could have one leak for every 3 or 4 argv entries
(you can't make it leak for every entry AFAICT). And the captured
string is smaller than the length of a single arg, which is limited to
131072 bytes on Linux I think.

http://www.in-ulm.de/~mascheck/various/argmax/

In practice I don't think it will matter because I don't think people
are really writing huge expressions with expr, because it will get
unreadable. It has no intermediate variables -- it would have to be a
single huge expression that does tons of regex captures or mixes types
a lot.

I was thinking to clean up in expr_main -- cleaning up more
aggressively would be a little tricky since get_str() only allocates
conditionally. But I guess you're saying that isn't worth it since
the process will be torn down.

And yes it definitely helps to have code in pending, so I'm all for
putting unfinished / untested stuff there! (Though what I'm surprised
by is that I started up ConnectBot on my Nexus 5X phone and was able
to tickle bugs which already have failing test cases! Like the
precedence bugs in expr.)

thanks,
Andy
Rob Landley
2016-03-16 07:49:09 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
If it's at the end of expr_main() we don't care. However, if leaks
per-input and we have unlimited input... (I believe the current linux
environment space limit is INT_MAX arguments each INT_MAX long, modulo
ulimit applying something to that maybe?)
This sounds per-argument, so I'd want to free it when it's done
with it if we can. (I need to read the code after applying your
patches.)
expr is a pure function of the argv, and the implementation almost
always uses the constant argv strings, which we don't need to manage.
- re() -- when you do expr foo : 'f\(.*\)', the resulting captured
string "oo" has to go somewhere
- type coercion of integers: if you do expr 2 + 3 \< abcd -- 5 on the
LHS is converted to "5" by get_str(), called from eval_op().
So if you don't do any of those things, you won't get any leaks. In
the worst case, you could have one leak for every 3 or 4 argv entries
(you can't make it leak for every entry AFAICT). And the captured
string is smaller than the length of a single arg, which is limited to
131072 bytes on Linux I think.
http://www.in-ulm.de/~mascheck/various/argmax/
Yeah I thought so too until recently, but it turns out that's old
information. The 2.6.22 kernel (July 2007) switched it to INT_MAX many
arguments, each of which can 131072 bytes long:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b6a2fea39318

That said, there are such things as pilot error. However you do it,
resource exhaustion's a possibility we can't entirely avoid. Just try
not to be wasteful.
Post by Andy Chu
In practice I don't think it will matter because I don't think people
are really writing huge expressions with expr, because it will get
unreadable. It has no intermediate variables -- it would have to be a
single huge expression that does tons of regex captures or mixes types
a lot.
Eh, if the variables are still live, they can get cleaned up afterwards.
If the algorithm naturally treats (X+2)-(Y/3) by calculating X+2 first,
discarding those nodes when we're done with them makes sense. But if it
doesn't, and we can't, oh well.

Note: the environment space is laid out linearly by Linux (and the
startup code in libc's crt1.o depends on this, I believe there's a
standard but am not sure off the top of my head which one, possibly
ELF?), and lib/args.c may cherry-pick arguments but won't reorder them
(guaranteed!), so you can detect that an allocation isn't in optargs by
checking:

if (p<*toys.optargs || p>toys.optargs[toys.optc-1]) not_argv();

It would be nice to free this data ourselves because expr is close to
$((1+2)) in the shell, and it would be nice if the shell could just
internall call expr nofork in that case.

(If we can't do the cleanup, we could also xrun() it and pass the data
back via a pipe, but that's not ideal.)
Post by Andy Chu
I was thinking to clean up in expr_main -- cleaning up more
aggressively would be a little tricky since get_str() only allocates
conditionally. But I guess you're saying that isn't worth it since
the process will be torn down.
Not in the nofork case, and if the shell can recycle expr to implement
$(( )) then extra work to make it callable nofork is good.
Post by Andy Chu
And yes it definitely helps to have code in pending, so I'm all for
putting unfinished / untested stuff there! (Though what I'm surprised
by is that I started up ConnectBot on my Nexus 5X phone and was able
to tickle bugs which already have failing test cases! Like the
precedence bugs in expr.)
I often fix the test case before I fix the code, to remind me to fix the
code. :)

(I don't always check those in though, because "git diff" showing me a
comment or hunk of code is an easy way of keeping track of todo items.)

Alas it's not so obvious to people other than me which failing test
cases I added, and which were contributed that way.
Post by Andy Chu
thanks,
Andy
Rob
Andy Chu
2016-03-16 18:34:27 UTC
Permalink
Post by Rob Landley
if (p<*toys.optargs || p>toys.optargs[toys.optc-1]) not_argv();
It would be nice to free this data ourselves because expr is close to
$((1+2)) in the shell, and it would be nice if the shell could just
internall call expr nofork in that case.
I attached to a patch to track the strings allocated and then free
them at the end.

(Note that there is NOT a "node" per argument ('struct value' which
which contains an int or string). The nodes are on the stack and used
destructively in the style of a *= b, a += b, etc. What we are
talking about is the strings that the nodes (may) point to. These are
almost always argv strings which don't need to be freed, but in the 2
cases I mentioned, we allocate them.

So now this is fixed, and I tested it with ASAN (with LLVM 3.8 which
can be downloaded easily). It easily finds the memory leak before
(only the regex test cases where we allocate memory fail, not every
test case) and confirms they pass after the patch.

ASAN also found another existing memory leak in the code! (not
calling regfree()) When I'm triaging the tests I'll definitely see
which ones pass under ASAN too.

Andy
Rob Landley
2016-03-16 22:00:08 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
if (p<*toys.optargs || p>toys.optargs[toys.optc-1]) not_argv();
It would be nice to free this data ourselves because expr is close to
$((1+2)) in the shell, and it would be nice if the shell could just
internall call expr nofork in that case.
I attached to a patch to track the strings allocated and then free
them at the end.
I've already started cleanup on this command, but I'll integrate this
with what I've got.
Post by Andy Chu
(Note that there is NOT a "node" per argument ('struct value' which
which contains an int or string). The nodes are on the stack and used
destructively in the style of a *= b, a += b, etc. What we are
talking about is the strings that the nodes (may) point to. These are
almost always argv strings which don't need to be freed, but in the 2
cases I mentioned, we allocate them.
I noticed that during review earlier today, but hadn't gotten to the
point of doing anything about that yet. :)
Post by Andy Chu
So now this is fixed, and I tested it with ASAN (with LLVM 3.8 which
can be downloaded easily). It easily finds the memory leak before
(only the regex test cases where we allocate memory fail, not every
test case) and confirms they pass after the patch.
ASAN also found another existing memory leak in the code! (not
calling regfree()) When I'm triaging the tests I'll definitely see
which ones pass under ASAN too.
Sounds like fun. I hope to have this done and promoted for the next
release. (Ballpark end of April.)
Post by Andy Chu
Andy
Rob
Andy Chu
2016-03-16 23:44:25 UTC
Permalink
Post by Rob Landley
I've already started cleanup on this command, but I'll integrate this
with what I've got.
Sounds like fun. I hope to have this done and promoted for the next
release. (Ballpark end of April.)
OK great! I'd like to see the diff for future reference -- I've
looked through your cleanup page. Perhaps you can apply at least the
first 2 patches verbatim (as long as there are no egregious errors),
and then your cleanup commits, so I can just look at the history?

I wanted to bring up the one TODO in the patches... syntax_error()
defaults to printing "syntax error", but the caller passes a more
detailed error message, which can be enabled with "if (1)".

If it were my choice I would always use the verbose error message and
get rid of the if. I added the option for the cryptic message because
I thought that was the style of the project -- a smaller vocabulary
and less space for constant strings (?).

But if I have to debug this code again, I much prefer each error
message to have a unique message, so you don't have to go into a
debugger to see what happened (and also for general usability). In
fact I should have probably tested each of the syntax_error() calls
like this:

testing "error: missing )" "expr \( 5 + 5 2>&1" "expr: Expected )" ...

To illustrate the advantages, I think you wouldn't have been mystified
in your blog post about this behavior if GNU expr had better error
messages:

$ expr + 1
1

$ expr - 1
expr: syntax error

$ expr +
expr: syntax error

$ expr -
-

I would have written this as:

$ expr - 1
expr: expected atom, got operator -

$ expr +
expr: escape requires operand

That would make it clearer that + is both a unary and binary operator,
while - is only a binary operator.

thanks,
Andy







Andy
Rob Landley
2016-03-17 00:27:41 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
I've already started cleanup on this command, but I'll integrate this
with what I've got.
Sounds like fun. I hope to have this done and promoted for the next
release. (Ballpark end of April.)
OK great! I'd like to see the diff for future reference
First pass was pushed to https://github.com/landley/toybox around
lunchtime: https://github.com/landley/toybox/commit/1af3bad8b63b
Post by Andy Chu
-- I've
looked through your cleanup page. Perhaps you can apply at least the
first 2 patches verbatim (as long as there are no egregious errors),
and then your cleanup commits, so I can just look at the history?
I applied your two patches and I've done one round of cleanup on top of
them so far. Then I started replacing "TT.tok = toys.optargs++" with
directly incrementing TT.tok and I introduced a bug with parentheses
management and got myself confused and went to do something else for a
while and haven't gotten back to it yet.
Post by Andy Chu
I wanted to bring up the one TODO in the patches... syntax_error()
defaults to printing "syntax error", but the caller passes a more
detailed error message, which can be enabled with "if (1)".
This morning's cleanup pass replaced the calls to syntax_error() with
calls to error_exit(), making it always print the more detailed error
message.
Post by Andy Chu
If it were my choice I would always use the verbose error message and
get rid of the if. I added the option for the cryptic message because
I thought that was the style of the project -- a smaller vocabulary
and less space for constant strings (?).
It's not less space for constant strings, it's an intentionally
restricted vocabulary to be less hard on non-english speakers.

I don't have a direct jump-to#tag in
http://landley.net/toybox/design.html but it's "error messages and
internationalization". (And I probably should add an index with #tags to
that page...)
Post by Andy Chu
But if I have to debug this code again, I much prefer each error
message to have a unique message, so you don't have to go into a
debugger to see what happened (and also for general usability). In
fact I should have probably tested each of the syntax_error() calls
testing "error: missing )" "expr \( 5 + 5 2>&1" "expr: Expected )" ...
By the way, "expr ( )" printed "missing )", I have a better error
message in the current one but as I said: buggy at present.
Post by Andy Chu
To illustrate the advantages, I think you wouldn't have been mystified
in your blog post about this behavior if GNU expr had better error
$ expr + 1
1
$ expr - 1
expr: syntax error
$ expr +
expr: syntax error
$ expr -
-
$ expr - 1
expr: expected atom, got operator -
$ expr +
expr: escape requires operand
That would make it clearer that + is both a unary and binary operator,
while - is only a binary operator.
So my help text is wrong and not _all_ operators are infix.

Sigh...

(I did a cleanup pass earlier, and often with cleanup I start with the
help text. What SHOULD this command be doing? Documentation is _not_ an
afterthought...)

Rob
Andy Chu
2016-03-17 01:20:26 UTC
Permalink
Post by Rob Landley
I applied your two patches and I've done one round of cleanup on top of
them so far. Then I started replacing "TT.tok = toys.optargs++" with
directly incrementing TT.tok and I introduced a bug with parentheses
management and got myself confused and went to do something else for a
while and haven't gotten back to it yet.
Yeah it's a little fiddly, but elegant when you get used to the
algorithm's logic. To fix the misleading error message on expr \( \),
try this.

+ if (!strcmp(TT.tok, ")")) { // an atom can't start with )
+ error_exit("Unexpected )");
- if (!strcmp(TT.tok, "(")) { // parenthesized expression
+ } else if (!strcmp(TT.tok, "(")) { // parenthesized expression

The entry into eval_expr() expects an "atom", which is either
something like "1" or "a" or a parenthesized expression like ( 1 + 2
).

Note that expr \* is valid -- it treats it like a literal string. So
expr \) was ALSO valid.

expr \( \) \) was valid -- it is a string inside parentheses. Thus
expr \( \) was being parsed as OPEN PAREN, STRING, and then ERROR,
expected closing \). So it was technically correct, though very
confusing This tiny patch will disallow things like expr \) and expr
\( \) \).

I actually had this check in my original version, but removed it
because of minimalism, but didn't realize that confusing case.
Post by Rob Landley
Post by Andy Chu
I wanted to bring up the one TODO in the patches... syntax_error()
defaults to printing "syntax error", but the caller passes a more
detailed error message, which can be enabled with "if (1)".
This morning's cleanup pass replaced the calls to syntax_error() with
calls to error_exit(), making it always print the more detailed error
message.
OK awesome! Makes sense.
Post by Rob Landley
Post by Andy Chu
That would make it clearer that + is both a unary and binary operator,
while - is only a binary operator.
So my help text is wrong and not _all_ operators are infix.
Sigh...
No your help is correct! I didn't implement what GNU expr does (nor
did existing code). It "puns" + as a unary escaping operator (in
addition to infix addition).

I wrote about this in my message about find / expr / test.

It's because GNU has "match" / "substr" operators, which we don't
have. This is the problem:

$ expr foo = foo
1

$ expr match = match
expr: syntax error # it's expecting arguments to the operator "match"

$ expr + match = + match # escape match operator as string
1

I think the command is OK as is -- it's POSIX conformant... It's
possible we could run into something in Aboriginal LInux that uses the
match / substr operators, in which case we could add them and the
annoying + unary escape.

Andy
Andy Chu
2016-03-20 06:32:14 UTC
Permalink
Rob, not sure if you're done with this yet, but the strategy for
freeing here doesn't quite work:

https://github.com/landley/toybox/commit/0ec95b7c2e20cd7be33bae6adba20bf89c5f3e86

I thought about exactly this: just free in eval_op(). But there are a
few problems -- for example:

1) re() allocates a string for the match
2) You free it in eval_op() free(v->s). You still have a struct
value, but it has an invalid pointer.
3) You use the returned string for another op

Basically something like:

expr foo : '\(.+\)' == foo

There are other possible bugs with that scheme too, e.g. involving
coercion. ASAN easily flags the memory leaks and regex tests fail.
With the strategy I sent, it doesn't detect any memory leaks.

(I will send a patch for adding make asantest, make asantest_expr,
etc. which is on top of my test harness fixes.)

-----

Also, while your mind is on expr, and since you mentioned $(()) in
bash ... I happened to look at the mksh source code (Android shell),
and it uses the same algorithm for their $(()). This is in contrast
coreutils and busybox expr which use recursive descent.

https://github.com/MirBSD/mksh/blob/master/expr.c#L404 (loop that
breaks on precedence)

I guess the main difference is that there are unary operators there
like -- and ++, and a ternary ? : operator. I guess it's not much
info but I thought it was interesting! :)

Andy
Post by Andy Chu
Post by Rob Landley
I applied your two patches and I've done one round of cleanup on top of
them so far. Then I started replacing "TT.tok = toys.optargs++" with
directly incrementing TT.tok and I introduced a bug with parentheses
management and got myself confused and went to do something else for a
while and haven't gotten back to it yet.
Yeah it's a little fiddly, but elegant when you get used to the
algorithm's logic. To fix the misleading error message on expr \( \),
try this.
+ if (!strcmp(TT.tok, ")")) { // an atom can't start with )
+ error_exit("Unexpected )");
- if (!strcmp(TT.tok, "(")) { // parenthesized expression
+ } else if (!strcmp(TT.tok, "(")) { // parenthesized expression
The entry into eval_expr() expects an "atom", which is either
something like "1" or "a" or a parenthesized expression like ( 1 + 2
).
Note that expr \* is valid -- it treats it like a literal string. So
expr \) was ALSO valid.
expr \( \) \) was valid -- it is a string inside parentheses. Thus
expr \( \) was being parsed as OPEN PAREN, STRING, and then ERROR,
expected closing \). So it was technically correct, though very
confusing This tiny patch will disallow things like expr \) and expr
\( \) \).
I actually had this check in my original version, but removed it
because of minimalism, but didn't realize that confusing case.
Post by Rob Landley
Post by Andy Chu
I wanted to bring up the one TODO in the patches... syntax_error()
defaults to printing "syntax error", but the caller passes a more
detailed error message, which can be enabled with "if (1)".
This morning's cleanup pass replaced the calls to syntax_error() with
calls to error_exit(), making it always print the more detailed error
message.
OK awesome! Makes sense.
Post by Rob Landley
Post by Andy Chu
That would make it clearer that + is both a unary and binary operator,
while - is only a binary operator.
So my help text is wrong and not _all_ operators are infix.
Sigh...
No your help is correct! I didn't implement what GNU expr does (nor
did existing code). It "puns" + as a unary escaping operator (in
addition to infix addition).
I wrote about this in my message about find / expr / test.
It's because GNU has "match" / "substr" operators, which we don't
$ expr foo = foo
1
$ expr match = match
expr: syntax error # it's expecting arguments to the operator "match"
$ expr + match = + match # escape match operator as string
1
I think the command is OK as is -- it's POSIX conformant... It's
possible we could run into something in Aboriginal LInux that uses the
match / substr operators, in which case we could add them and the
annoying + unary escape.
Andy
Loading...