Discussion:
[PATCH] Fix wc -m on bionic.
Add Reply
enh
2017-08-05 00:54:19 UTC
Reply
Permalink
Raw Message
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.

With these changes, the tests still pass on glibc and also pass on bionic.
---
tests/wc.test | 7 +------
toys/posix/wc.c | 34 ++++++++++++++--------------------
2 files changed, 15 insertions(+), 26 deletions(-)
Rob Landley
2017-08-07 01:38:50 UTC
Reply
Permalink
Raw Message
Post by enh
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.
I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.

What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)

I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)

Meanwhile, the code you just sent me is not clearing the magic internal
state, and it's user-visible:

$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
Post by enh
With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...

Rob
enh
2017-08-08 22:21:29 UTC
Reply
Permalink
Raw Message
(sorry, didn't get time to look at this until now.)
Post by Rob Landley
Post by enh
When mbrtowc returns -2, all n bytes have been processed. Bionic's
interpretation of POSIX is that you must not re-supply those bytes
on the next call, and should only supply the bytes needed to complete
the character. If you re-supply the bytes on the next call, bionic
considers that an illegal sequence and returns -1.
I've had a headache for 3 days and it's really hard for me to make good
technical decisions right now, because I want to set fire to everything
for being pointlessly overcomplicated.
What I really wanted was a function that didn't maintain magic internal
state between calls, but would just return if it couldn't do what I
asked and let me handle it myself or try again with more data. (When I'm
escaping invalid chars that's the semantics I need, and why have two?)
I don't know how to make modern libc do the simple thing, it insists on
maintaining state between calls even when you pass in a NULL for the
state, because reasons. It sounds like if I want a simple converter, I
need to write my in lib.c, or maybe wrap this one with a flush after
every error. (Sending it a zero byte should flush? I think?)
yeah, this is yet another ugly corner of POSIX. you can get what you
want, though it's pretty unguessable. the function mbsinit is a
_predicate_ to tell you whether the mbstate_t is in the initial state
or not. you're supposed to memset or moral equivalent to reset to the
initial state. afaik there's no way to reset the global state. i can't
think i've ever seen one.
Post by Rob Landley
Meanwhile, the code you just sent me is not clearing the magic internal
$ echo -ne '\xf0\xbf\xbf\xbf' | wc -m
1
$ echo -ne '\xf0\xbf' > one
$ echo -ne '\xbf\xbf' > two
$ wc -m one two
0 one
0 two
0 total
$ ./toybox wc -m one two
0 one
1 two
1 total
the attached updated patch fixes that and adds the test.
Post by Rob Landley
Post by enh
With these changes, the tests still pass on glibc and also pass on bionic.
I should test musl. After I go lie down again...
Rob
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Loading...