[Toybox] [PATCH] tr.c: added -t option and cleanup up formatting
Rob Landley
rob at landley.net
Sat Oct 21 03:04:49 PDT 2023
On 10/20/23 23:22, Oliver Webb via Toybox wrote:
> Heya, I noticed that tr was in pending, taking a look at the source code.
Yeah, it's one of the big remaining todo items to get Linux From Scratch
building, I was looking at it briefly last week...
> It doesn't look very unclean, nor does it fail any test cases.
I have a redesign to make it handle utf-8 encoded unicode, both in the input and
in the patterns. Took me forever to work out how, but I _think_ I understand it
now? Just haven't done it yet.
Well, I think I've figured out how to handle unicode (with combining characters)
and the [:class:] specifiers. Still don't understand what [=CHAR=] equivalency
classes mean, exactly, other than "strip combining characters"? Except there's a
lot of À Á Â Ã in the base set that... the man page says that equivalence
classes are defined by LC_COLLATE but everybody seems to punt on the specifics.
(Or maybe this is just a symptom of Google having a harder time finding stuff
these days? Section 3.1.3.6 of http://unicode.org/L2/L2001/01487-14652w25.pdf is
not very illuminating.)
Anyway, hadn't dug into that part yet. Vaguely planning to punt and wait for a
complaint, because the OTHER thing that comes up a lot when you search for this
is "it doesn't work". Although I am highly amused by the database error at:
https://www.unix.com/shell-programming-and-scripting/283373-equivalence-classes-dont-work.html
Which is saying that the page talking about how equivalence classes don't work
itself does not work.
This guy went into detail, but I have not opened that particular can of worms yet:
http://databasearchitects.blogspot.com/2016/08/equivalence-of-unicode-strings-is.html
> The only 2 things
> in the TODO are -t and -a. Neither POSIX or GNU tr specify a -a[scii] option.
> The name gives a general idea of what it's supposed to do
> (Stop acting utf-8 safe and treat everything as extended ASCII?)
It's a note-to-self that there should probably be a way to disable the unicode
support I haven't added yet, and that -a isn't currently used anywhere I could find.
> I added in a -t[runcate] option and a corresponding test case.
>
> I also cleaned up some of code (foobar[0] to *foobar, removing sizeof(char), etc)
Applied, and I did a little more cleanup while I was there.
Rob
More information about the Toybox
mailing list