[Toybox] [PATCH] POSIX's unexpand command
Oliver Webb
aquahobbyist at proton.me
Fri Feb 23 22:55:45 PST 2024
On Friday, February 23rd, 2024 at 21:14, Mouse <mouse at Rodents-Montreal.ORG> wrote:
> > unexpand "converts spaces to tabs".
>
> > This commands behavior is so simple (s/ /\t/g) that it can be
> > knocked out in a couple hours,
>
> Well...sort of. unexpand without -a can be, sure. With -a, it's more
> complicated, unless you are willing to assume things like "no multibyte
> characters" or "all non-ASCII text is Shift-JIS".
>
> > Since the command only looks for 2 characters (' ' and '\t'), no UTF
> > safety checking is required,
>
> Safety? If you want to support multibyte characters of any sort with
> -a, you need to parse them enough to determine how many bytes make up
> each character, because that affects how many spaces to eat to convert
> to a tab. (Without -a, this is not an issue.)
>
> For example, if you get a line containing, in hex,
>
> d0 b0 d0 b0 d0 b0 20 20 20 20 20 20 20 20 40
>
> then (assuming 8-character tabstops and -a in effect), then under
> 8859-1 you have (to use Unicode names) LATIN CAPITAL LETTER ETH and
> DEGREE SIGN, with the pair repeated three times, and you thus convert
> the first two of the spaces to a tab, but under UTF-8 you have three
> instances of CYRILLIC SMALL LETTER A and you thus convert the first
> five of the spaces to a tab. (Handling tabs in the input makes it
> even more complicated.)
>From the NetBSD Manpage you quote later:
"If the -a option is given, then tabs are inserted
whenever they would compress the resultant file by replacing two or
more characters."
Correct me if I'm wrong, But I don't see how utf8 has anything to do with this? it takes a string of spaces,
Replaces it with length/tabwidth tabs, then length%tabwidth spaces, POSIX says this too:
"translate all sequences of two or more <blank> characters immediately preceding a tab stop
to the maximum number of <tab> characters followed by the minimum number of <space> characters
needed to fill the same column positions originally filled by the translated <blank> characters."
Sigh, skimming over lib/utf8.c, assuming utf8len() is like strlen() but for utf8,
that might make things a bit easier? Was hoping to never have to touch utf8 while writing
this.
> > The GNU man page doesn't say if spaces are supposed to be processed
> > beyond the beginning of lines.
>
>
> The GNU man page is relevant to only the GNU version.
It's not relevant to _any_ version because it does not document the behavior
of any implementation, Not even it's own. It fails to document known user-end
noticeable things such as the actual behavior of -a. Saying
"convert all blanks, instead of just initial blanks" and NOTHING else for the behavior
of -a is misleading.
> I would not use
> it as a reference for anything else, least of all what the command
> should do in the abstract. (That said, I would have hoped they would
> document their software more precisely, such as saying what happens to
> non-initial whitespace in the absence of -a.)
>
> A non-GNU (NetBSD) manapge I have handy says
>
> -a By default, only leading blanks and tabs are reconverted to maximal
> strings of tabs. If the -a option is given, then tabs are inserted
> whenever they would compress the resultant file by replacing two or
> more characters.
>
> which is, at least, clearer. (That version has nothing like GNU's
> --first-only, or at least the manpage doesn't.)
>
> > [...], and the "--first-only" option serves the same purpose as grep
> > -G (None at all, [...])
>
>
> Actually, it does; it can be specified to get the default behaviour
> when the opposing option might have been specified already. For
> example, if I have a wrapper script (let's call it "unex")
>
> #! /bin/sh
> set $UNEX_OPTIONS "$@"
> unexpand "$@"
>
> then I can run "unex --first-only" to get the default behaviour
> regardless of whether -a is present in $UNEX_OPTIONS.
- Oliver Webb <aquahobbyist at proton.me>
More information about the Toybox
mailing list