[Toybox] [PATCH] POSIX's unexpand command
Mouse
mouse at Rodents-Montreal.ORG
Fri Feb 23 19:14:25 PST 2024
> unexpand "converts spaces to tabs".
> This commands behavior is so simple (s/ /\t/g) that it can be
> knocked out in a couple hours,
Well...sort of. unexpand without -a can be, sure. With -a, it's more
complicated, unless you are willing to assume things like "no multibyte
characters" or "all non-ASCII text is Shift-JIS".
> Since the command only looks for 2 characters (' ' and '\t'), no UTF
> safety checking is required,
Safety? If you want to support multibyte characters of any sort with
-a, you need to parse them enough to determine how many bytes make up
each character, because that affects how many spaces to eat to convert
to a tab. (Without -a, this is not an issue.)
For example, if you get a line containing, in hex,
d0 b0 d0 b0 d0 b0 20 20 20 20 20 20 20 20 40
then (assuming 8-character tabstops and -a in effect), then under
8859-1 you have (to use Unicode names) LATIN CAPITAL LETTER ETH and
DEGREE SIGN, with the pair repeated three times, and you thus convert
the first _two_ of the spaces to a tab, but under UTF-8 you have three
instances of CYRILLIC SMALL LETTER A and you thus convert the first
_five_ of the spaces to a tab. (Handling tabs in the input makes it
even more complicated.)
> The GNU man page doesn't say if spaces are supposed to be processed
> beyond the beginning of lines.
The GNU man page is relevant to only the GNU version. I would not use
it as a reference for anything else, least of all what the command
should do in the abstract. (That said, I would have hoped they would
document their software more precisely, such as saying what happens to
non-initial whitespace in the absence of -a.)
A non-GNU (NetBSD) manapge I have handy says
-a By default, only leading blanks and tabs are reconverted to maximal
strings of tabs. If the -a option is given, then tabs are inserted
whenever they would compress the resultant file by replacing two or
more characters.
which is, at least, clearer. (That version has nothing like GNU's
--first-only, or at least the manpage doesn't.)
> [...], and the "--first-only" option serves the same purpose as grep
> -G (None at all, [...])
Actually, it does; it can be specified to get the default behaviour
when the opposing option might have been specified already. For
example, if I have a wrapper script (let's call it "unex")
#! /bin/sh
set $UNEX_OPTIONS "$@"
unexpand "$@"
then I can run "unex --first-only" to get the default behaviour
regardless of whether -a is present in $UNEX_OPTIONS.
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse at rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
More information about the Toybox
mailing list