[Toybox] [PATCH] POSIX's unexpand command

Mouse mouse at Rodents-Montreal.ORG
Sat Feb 24 04:26:02 PST 2024


>> For example, if you get a line containing, in hex,
>> 
>> d0 b0 d0 b0 d0 b0 20 20 20 20 20 20 20 20 40
>> 
>> then (assuming 8-character tabstops and -a in effect), then under
>> 8859-1 you have (to use Unicode names) LATIN CAPITAL LETTER ETH and
>> DEGREE SIGN, with the pair repeated three times, and you thus
>> convert the first two of the spaces to a tab, but under UTF-8 you
>> have three instances of CYRILLIC SMALL LETTER A and you thus convert
>> the first five of the spaces to a tab.  (Handling tabs in the input
>> makes it even more complicated.)

> From the NetBSD Manpage you quote later:
> "If the -a option is given, then tabs are inserted whenever they
> would compress the resultant file by replacing two or more
> characters."

> Correct me if I'm wrong, But I don't see how utf8 has anything to do
> with this?

unexpand, as defined by that manpage, is defined to operate on
characters, not bytes.  Thus, questions such as "UTF-8 or 8859-1 or
what?" are relevant, because they affect whether (for example) the
first two octets of the line I gave constitute two characters, one
character, part of one character, or what.

Interpreted as 8859-1, the string of octets I gave starts with six
nonblank characters, so, under the assumptions I described, the first
two spaces should be converted to a tab.  Under UTF-8, there are only
three characters - represented by six octets - before the string of
spaces, so the first five spaces should be replaced.  (I'd cite other
examples, but UTF-8 is the only multibyte character encoding I know
well enough to give an example of.)

> Was hoping to never have to touch utf8 while writing this.

Don't blame you.  I think UTF-8 is a major botch (variable-sized
character representations? seriously??) and have as little to do with
it as I can manage.

Unfortunately, fixing it right requires converting "everything" from
streams of octets to streams of Unicode codepoints, which is a lot of
work.  (I want to do it, someday, but it will be a major undertaking.)

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse at rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


More information about the Toybox mailing list