Discussion:
[PATCH] Fix wc -m on bionic.
(too old to reply)
enh
2017-08-05 00:54:19 UTC
Permalink
Raw Message
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.

With these changes, the tests still pass on glibc and also pass on bionic.
---
tests/wc.test | 7 +------
toys/posix/wc.c | 34 ++++++++++++++--------------------
2 files changed, 15 insertions(+), 26 deletions(-)
Rob Landley
2017-08-07 01:38:50 UTC
Permalink
Raw Message
Post by enh
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.
I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.

What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)

I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)

Meanwhile, the code you just sent me is not clearing the magic internal
state, and it's user-visible:

$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
Post by enh
With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...

Rob
enh
2017-08-08 22:21:29 UTC
Permalink
Raw Message
(sorry, didn't get time to look at this until now.)
Post by Rob Landley
Post by enh
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.
I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.
What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)
I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)
yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.
Post by Rob Landley
Meanwhile, the code you just sent me is not clearing the magic internal
$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
the attached updated patch fixes that and adds the test.
Post by Rob Landley
Post by enh
With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
enh
2017-08-22 21:05:37 UTC
Permalink
Raw Message
happy with the replacement patch?
Post by enh
(sorry, didn't get time to look at this until now.)
Post by Rob Landley
Post by enh
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.
I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.
What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)
I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)
yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.
Post by Rob Landley
Meanwhile, the code you just sent me is not clearing the magic internal
$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
the attached updated patch fixes that and adds the test.
Post by Rob Landley
Post by enh
With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2017-08-23 07:20:25 UTC
Permalink
Raw Message
Post by enh
happy with the replacement patch?
Sorry, $DAYJOB is eating all my energy these days
(http://lists.j-core.org/pipermail/j-core/2017-August/000646.html), and
thunderbird loses open reply windows on a reboot (my netbook battery
will die _while_plugged_in_ if you leave a USB thing drawing power from
a suspended system; it stops charging once the battery fills up and then
the USB drains it again, then it recharges the battery once it's fully
"off")...

Where was I on this one...
Post by enh
Post by enh
Post by Rob Landley
I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)
yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.
Before you sent me the new patch I wrote my own "grab the next utf8 char
without a state structure or caring about environment variables"
function (it wasn't hard and it's something I've been thinking about to
try to speed up "top" anyway; top level can be an inline function with
the "if (*s<128) {*len = 1; return *s}" fast path and then it can call
the real function for the rest), but I'm really uncomfortable replacing
a libc function like this so I wanted to compare mine with musl's
implementation _and_ do utf8 torture tests for all the weird corner
cases, so I needed to do a toys/example/test_utf8.c and appropriate
tests/test_utf8.test, and then you sent an updated patch and I went
"must compare/collate" but didn't get back to it...
Post by enh
Post by enh
Post by Rob Landley
Meanwhile, the code you just sent me is not clearing the magic internal
$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
the attached updated patch fixes that and adds the test.
On the bright side, I'm mostly caught up on $DAYJOB stuff from when I
was out sick for a few days. :)

Not remotely caught up with toybox yet, and the past few times I poked
at it was trying to get the next round of lsof.c cleanup ready (which is
relaxing because nobody's waiting for it)...

Gimme one more day to try to finish my utf8 replacement code, and if I
haven't I'll merge your patch as-is. Sorry for the delay,

Rob

Loading...