[Toybox] Does anyone here understand how unicode combining characters work?

Post by Rob Landley
The crunch_str() logic is designed to escape nonprintable stuff and for watch.c
I need to write something that measures output but lets utf8 combining stuff
happen. (And measures tabs. And also parses at least the color change part of
ansi escapes, but we'll burn that bridge when we come to it...)
Using hexdump and echo -e's hex escapes to try to print minimal bits of the
combining character examples (which cut and paste appears to have horked
$ cat tests/files/utf8/test1.txt
l̴̗̞̠ȩ̸̩̥ṱ̴͍̻ ̴̲͜ͅt̷͇̗̮h̵̥͉̝e̴̡̺̼ ̸̤̜͜ŗ̴͓͉i̶͉͓͎t̷̞̝̻u̶̻̫̗a̴̺͎̯l̴͍͜ͅ ̵̩̲̱c̷̩̟̖o̴̠͍̻m̸͚̬̘ṃ̷̢͜e̵̗͎̫n̸̨̦̖c̷̰̩͎e̴̱̞̗
$ echo -e '\xcc\xb4\xcc\x97\xcc\xa0e'
e
$ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0e'
l̴̗̠e
$ echo -e '\xcc\xb4\xcc\x97\xcc\xa0ee'
ee
$ echo -e 'l\xcc\xb4\xcc\x97\xcc\xa0'
l̴̗̠
$ echo -e '\xcc\xb4\xcc\x97\xcc\xa0'
So there needs to be a character _before_ the combining characters for them to
take effect, but they apply to the character _after_? Even when it's a newline?
(Which still works as a newline, but leaves trailing weirdness?)

But if I have just enough characters to fill a line, the trailing weirdness does
_not_ go to the next line (it appears to get discarded), at least on my 80 char
xfce Terminal:

echo -e
'xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a\xcc\xb4\xcc\x97\xcc\xa0'

I should look up what these escape sequences _do_. Hmmm... I could slowly and
painfully do that by hand, but really I want a sort of unicode version of
"hexdump -C" telling me what the codepoints are. (Ideally combined with a
variant of the "ascii" program to then tell me what each one does.) Somebody has
to have written this already, but I dunno what to Google for. Hmm...

Hey Rich, I'm fiddling with unicode and lost/confused. Know any good tools for this?

Rob

Rich Felker

2018-09-26 19:01:58 UTC

Combining characters (at the terminal, any wcwidth==0 characters since
there is no finer-grained distinction) attach to the
previous/logical-left character cell.

Post by Rob Landley
But if I have just enough characters to fill a line, the trailing weirdness does
_not_ go to the next line (it appears to get discarded), at least on my 80 char
echo -e
'xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a\xcc\xb4\xcc\x97\xcc\xa0'

What you should see is:

xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a̴̗̠

That is, the combining characters should be visible on the 'a' in the
last cell. I would not be surprised if some terminals get this wrong.

Post by Rob Landley
I should look up what these escape sequences _do_. Hmmm... I could slowly and
painfully do that by hand, but really I want a sort of unicode version of
"hexdump -C" telling me what the codepoints are. (Ideally combined with a
variant of the "ascii" program to then tell me what each one does.) Somebody has
to have written this already, but I dunno what to Google for. Hmm...
Hey Rich, I'm fiddling with unicode and lost/confused. Know any good tools for this?

enh

2018-09-26 19:21:46 UTC

in general ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt is
pretty useful too. iirc plan9 had a code point lookup tool, but
honestly i mainly type U+xxxx into Google and end up at
https://www.fileformat.info/info/unicode/char/2028/index.htm.

the wcwidth stuff isn't well defined (in that it's not a Unicode
notion, and is under-specified by POSIX) but Unicode does have the
"east asian width" data. see
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt for that.

the Unicode FAQs are often helpful too.
http://unicode.org/faq/char_combmark.html

plus the full standard is freely available:
http://www.unicode.org/versions/Unicode11.0.0/

Combining characters (at the terminal, any wcwidth==0 characters since
there is no finer-grained distinction) attach to the
previous/logical-left character cell.

xxxxxxxxxxxxxxxxxx0123456789091234567890123456789012345678901234567890123456789a̴̗̠
That is, the combining characters should be visible on the 'a' in the
last cell. I would not be surprised if some terminals get this wrong.

Does something like this help?
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
int main()
{
setlocale(LC_CTYPE, "");
wint_t c;
while ((c=getwchar())!=WEOF)
printf("U+%.4X wcwidth=%d\n", c, wcwidth(c));
}
Rich
_______________________________________________
Toybox mailing list
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Rich Felker

2018-09-26 19:39:06 UTC

Post by enh
in general ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt is
pretty useful too. iirc plan9 had a code point lookup tool, but
honestly i mainly type U+xxxx into Google and end up at
https://www.fileformat.info/info/unicode/char/2028/index.htm.
the wcwidth stuff isn't well defined (in that it's not a Unicode
notion, and is under-specified by POSIX) but Unicode does have the

This is true; it's only defined by convention between implementations
and terminal emulators, and without their agreement, everything
breaks.

Post by enh
"east asian width" data. see
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt for that.
the Unicode FAQs are often helpful too.
http://unicode.org/faq/char_combmark.html
http://www.unicode.org/versions/Unicode11.0.0/

Generally, implementations agree that characters with East Asian Width
property full or wide are wcwidth==2, and character classes Mn or Mc
(nonspacing or enclosing combining) are wcwidth==0. There are also a
number of class Cf characters that need to be treated as wcwidth==0
for the associated languages to work on a terminal.

Rich

enh

2018-09-26 20:00:06 UTC

if anyone's interested, here's how bionic translates from the actual
unicode properties to implement wcwidth:
https://android.googlesource.com/platform/bionic/+/master/libc/bionic/wcwidth.cpp

(we do this in general so that we can outsource all the actual
unicodet data to icu4c, and thereby guarantee consistency for
C/C++/Java regardless of which API is actually called.)

This is true; it's only defined by convention between implementations
and terminal emulators, and without their agreement, everything
breaks.

Rob Landley

2018-09-26 20:42:25 UTC

Post by enh
if anyone's interested, here's how bionic translates from the actual
https://android.googlesource.com/platform/bionic/+/master/libc/bionic/wcwidth.cpp
(we do this in general so that we can outsource all the actual
unicodet data to icu4c, and thereby guarantee consistency for
C/C++/Java regardless of which API is actually called.)

I think I've got the answer to my question now. what I needed to know was how
much I can print before the cursor winds up on the next line (and scrolls the
screen if it was at the bottom), and the answer is "print combining characters
_after_ the last character, but stop before the next wcwidth>0 character that
would overflow the line".

(This is the logic I've needed to work out for screen, less, and vi as well. At
least when they're not doing the force escapes thing.)

The ansi escape parsing is still a todo item, but I note I wrote my own ansi
escape parsing direct screen memory writer for DOS as one of my first C programs
back in 1990. :P

(And tabs. And the other low-ascii stuff that's also handled inconsistently and
which I might have watch and less and such filter out and just not print to the
tty. It'd be nice if TERM=linux specified consistent behavior here, but it's
determined by the terminal display program consuming the output...)

Thanks,

Rob

Rich Felker

2018-09-26 22:28:30 UTC

On a decent terminal (google "magic margins"), you can always print
the full width of the terminal, even on the last line, so if the
terminal width is 80, you print until the wcwidth of the next
character would throw the position strictly over 80 (81 or higher).

I'm not sure if there are still any non-magic-margin terminals that
are relevant. If so, and if you don't know what row you're on (e.g.
for shell line editing), you probably just need to stop at 1 column
less than the width to be safe. You could probably hardcode a list of
$TERM values for broken terminals though.

Post by Rob Landley
(This is the logic I've needed to work out for screen, less, and vi as well. At
least when they're not doing the force escapes thing.)
The ansi escape parsing is still a todo item, but I note I wrote my own ansi
escape parsing direct screen memory writer for DOS as one of my first C programs
back in 1990. :P
(And tabs. And the other low-ascii stuff that's also handled inconsistently and
which I might have watch and less and such filter out and just not print to the
tty. It'd be nice if TERM=linux specified consistent behavior here, but it's
determined by the terminal display program consuming the output...)

I think most of this stuff is largely Unicode-agnostic, and is just a
matter of understanding classic terminal behavior and the idioms for
dealing with it.

Rich

Rob Landley

2018-09-27 13:53:07 UTC

Post by Rob Landley
I think I've got the answer to my question now. what I needed to know was how
much I can print before the cursor winds up on the next line (and scrolls the
screen if it was at the bottom), and the answer is "print combining characters
_after_ the last character, but stop before the next wcwidth>0 character that
would overflow the line".

I haven't encountered any, and that's how top works. Nobody's complained yet.

Post by Rich Felker
If so, and if you don't know what row you're on (e.g.
for shell line editing), you probably just need to stop at 1 column
less than the width to be safe. You could probably hardcode a list of
$TERM values for broken terminals though.

It's not $TERM, it's the xterm consuming the output making that decision. $TERM
largely boils down to which ANSI escapes to produce behind the scenes. I don't
think your xterm can even read its child process's environment variables. (Well,
I suppose it could through /proc/$PID/env but I'm unaware of any of them doing
it...)

The whole $TERM nonsense is legacy of physical teletype machines, then "glass
tty" terminals (VT100, TN3270, etc) that emulated them and added bespoke
per-vendor escape sequences. The IBM PC text mode swept the field (to the point
I had an amiga terminal that emulated it for bulletin boards), but "this code
was written and works so nobody's going to throw it out" kept bad legacy
assumptions alive for decades longer than they made any sense.

Post by Rob Landley
(And tabs. And the other low-ascii stuff that's also handled inconsistently and
which I might have watch and less and such filter out and just not print to the
tty. It'd be nice if TERM=linux specified consistent behavior here, but it's
determined by the terminal display program consuming the output...)

I think most of this stuff is largely Unicode-agnostic, and is just a
matter of understanding classic terminal behavior and the idioms for
dealing with it.

The low-ascii stuff is not related to unicode, yes. But it got swept up in the
unicode changes and behavior changed when unicode support went in. And
unfortunately, terminal programs differ and the Linux ctrl-alt-f1 text mode
terminals differ from the xterms. Haven't tried a frame buffer yet...)

For example, when I do echo -e '\x02\x02\x03\x04x' on xfce xterm, I get 4 square
boxes with digits in (I.E. uni-codepoint has no glyph, doo dah, doo dah)
followed by x. But ctrl-alt-f1 text mode prints nothing and does not advance the
cursor either, I just get the x on the first column. (I even tried "export
TERM=linux" in both and it didn't change the behavior, that's orthogonal.)

Hence filtering some of them out and not printing them if I dunno whether
they'll advance the cursor or not.

Post by Rich Felker
Rich

Going down ratholes most people never noticed the existence of, as usual.

(You wrote your own xterm, what does _it_ do here?)

Rob

Rob Landley

2018-09-27 14:10:04 UTC

Post by Rob Landley
The low-ascii stuff is not related to unicode, yes. But it got swept up in the
unicode changes and behavior changed when unicode support went in. And
unfortunately, terminal programs differ and the Linux ctrl-alt-f1 text mode
terminals differ from the xterms. Haven't tried a frame buffer yet...)
For example, when I do echo -e '\x02\x02\x03\x04x' on xfce xterm, I get 4 square
boxes with digits in (I.E. uni-codepoint has no glyph, doo dah, doo dah)
followed by x. But ctrl-alt-f1 text mode prints nothing and does not advance the
cursor either, I just get the x on the first column. (I even tried "export
TERM=linux" in both and it didn't change the behavior, that's orthogonal.)
Hence filtering some of them out and not printing them if I dunno whether
they'll advance the cursor or not.

P.S. I've got this commented out not to self in my local tests/ls.test:

echo -e "$(X=0;while [ $X -lt 255 ];do X=$(($X+1));[ $X -eq 47 ]&&
continue;printf '\\x%02x' $X; done)"

Which I think was meant to create a torture test for ls -b display mode? Ala
touch "$(that)" in an empty directory and ls -b it.

That says on this xterm, outputting ascii 0 doesnt' display, 1-4 are boxes, 5 is
ignored, 6 is a box, 7-f aren't boxes but there's two a couple line breaks in
there (\b, \t, \r, and \n live in that range, then 0x10 through 1f are boxes again).

Meanwhile, in Linux text mode the first non-space character printed is ! and if
I add an 'x' after the character printed each time it's:

xxxxxxx x
x
x
x|xxxxxxxxxxxxxxx x!x[and so on]

(Which is confused by \b and \r taking effect, but why is there's a pipe after
ascii 16???)

Post by Rob Landley
Going down ratholes most people never noticed the existence of, as usual.

Continuing down said rathole...

(I'm pretty sure "faking the linux VGA text mode behavior for low ascii
characters" is as close to 'a standard" as we're likely to get here.)

Rob

enh

2018-09-27 20:34:46 UTC

echo -e "$(X=0;while [ $X -lt 255 ];do X=$(($X+1));[ $X -eq 47 ]&&
continue;printf '\\x%02x' $X; done)"
Which I think was meant to create a torture test for ls -b display mode? Ala
touch "$(that)" in an empty directory and ls -b it.
That says on this xterm, outputting ascii 0 doesnt' display,

having written several terminal emulators (including the one i still
use every day), if you do show something for NUL you find that a
surprising number of C programs have an off-by-one that causes them to
accidentally output the NUL terminator too.

Post by Rob Landley
1-4 are boxes, 5 is
ignored, 6 is a box, 7-f aren't boxes but there's two a couple line breaks in
there (\b, \t, \r, and \n live in that range, then 0x10 through 1f are boxes again).

http://spinroot.com/pico/pjw.html (search for "Plan 9").

Post by Rob Landley
Meanwhile, in Linux text mode the first non-space character printed is ! and if
xxxxxxx x
x
x
x|xxxxxxxxxxxxxxx x!x[and so on]
(Which is confused by \b and \r taking effect, but why is there's a pipe after
ascii 16???)

Post by Rob Landley
Going down ratholes most people never noticed the existence of, as usual.

Continuing down said rathole...
(I'm pretty sure "faking the linux VGA text mode behavior for low ascii
characters" is as close to 'a standard" as we're likely to get here.)
Rob

Rob Landley

2018-09-26 20:59:16 UTC

Combining characters (at the terminal, any wcwidth==0 characters since
there is no finer-grained distinction) attach to the
previous/logical-left character cell.

The xfce terminal shows all the data on the character to the right.

Thunderbird sticks ~ characters in between stuff, but shows the sub-whatsis
(cedilla?) under the character to the right.

I pulled up the web archive in chrome on a windows box at work and it's... sort
of doing both? The second example on in the list ("le") is showing the
under-apostrophe under the l but has some sort of overstrike through the E, and
the next to last one has l with an under-apostrophe but then a tilde after it.

Ahem: Wheee.