[Toybox] utf8 display question.

scsijon scsijon at lamiaworks.com.au
Thu Oct 26 21:31:14 PDT 2017


Not just japanese, most kanji is usually double-width, some abjad (think 
arabic for simplicity) and a few odd others also use a mix of single and 
double width characters. There is also a few that use half-width and 
single with mixed and some even have tripple-width to contend with.

https://msdn.microsoft.com/en-us/library/cc194788.aspx

will give you some general idea of the basic formats for kana/kanji, but 
really I can only say, good luck with it all.

> Date: Wed, 25 Oct 2017 18:42:14 -0500
> From: Rob Landley <rob at landley.net>
> To: toybox at lists.landley.net
> Subject: [Toybox] utf8 display question.
> Message-ID: <47ce57b4-486b-2920-5358-92ee955f467b at landley.net>
> Content-Type: text/plain; charset=utf-8
>
> I'm adding cut -C to do column-based selection, what should it do about
> the middle of double width characters? middle of double width
> characters? Right now I'm having it round down, so since japanese text
> is double width in monospaced fonts:
>
> $ cat tests/files/utf8/japan.txt && echo
> ?????????????????????????
> $ ./cut -C 5-11 tests/files/utf8/japan.txt
> ???
>
> I.E. 5 skips the first 2 (which starts at column 4, the next display
> point _below_ 5), and then it continues to stop before the ending
> column. (So 5-11 is the same as 5-10, and 5-12 shows 4 characters
> because the 4th character includes column 12).
>
> This is consistent, but I'm not sure if it's right...? Should the first
> one round up instead? (Since it's an exclusion range, should the start
> fail forward and the end fail backwards?)
>
> Dunno...
>
> Rob
>



More information about the Toybox mailing list