[Toybox] [PATCH] ls: Ensure file names are separated by 2 spaces

Andrew Ilijic ilijic.andrew at gmail.com
Fri Oct 18 10:39:49 PDT 2019


Hi Rob,

After the third reading of your email, I see that we were saying the same thing.
>
> What _is_ a good test is "-C is utf8 fontmetrics, not a byte count". So if you do...
>
I will look into fontmetrics and a modified `wc` to test the combining
characters.

~Andrew

On Fri, Oct 18, 2019 at 12:02 PM Andrew Ilijic <ilijic.andrew at gmail.com> wrote:
>
> Thank you Rob,
>
> I will add one or more tests to make sure that the two spaces are
> present. For the first test or two, I plan to use ASCII characters.
> The rationale being that if someone decides to compile the program on
> an old system or an embedded system with a minimal operating system
> the tests will work, but we include <locale.h> so that argument is
> probably moot.
>
> My first thought is to have either the program or the test detect its
> context. Then adjust or reinterpret the output depending on whether or
> not we expect the combining characters to eat the space. I looked into
> how other systems test this sort of thing. They take the input, output
> SVG, and then check that it matches the expected SVG.  Do any chance
> terminal emulators have a canonical representation for glyphs that we
> can examine?
>
> For the tests involving the combining characters, it seems like we are
> testing the terminal emulator UTF-8 compatibility more than the `ls`
> program. Rob, can you give an example of when a filename would have
> capturing characters at the end of it? The only thing I can think of
> is a language where, if two words are next to each other, they should
> modify each other. That seems to violate my idea of what a filename
> is.  Don't get me wrong, reading up on this issue and those that
> surround it is fascinating, and it is something that I will keep at
> the top of my mind.
>
> Rob, thank you for all of the work items you suggested. I am
> interested in working on all of them. I want to get my first patch
> accepted and merged and go from there. I don't want to commit to a
> bunch of things and then be unable to deliver.
>
> ~Andrew
>
> On Thu, Oct 17, 2019 at 5:22 PM Rob Landley <rob at landley.net> wrote:
> >
> > I have various torture test utf8 in test/files/utf8. To see a trailing combining
> > character eat a following space, you can do:
> >
> > $ head -c 5 tests/files/utf8/test3.txt; echo ' '
> >
> > Or
> >
> > $ head -c 5 tests/files/utf8/test1.txt; echo 'X'
> >
> > The problem is, it's easy to see visually, but I'm not quite sure how an
> > automated test is expected to care? (Other than going "it produced the right
> > output"...
> >
> > The OTHER fun thing is that -C and -x care about terminal size, or default to 80
> > columns when they can't determine it. In theory there's an ls -w option that
> > lets you set the width, but I never needed/implemented it. (Andrew: you're
> > welcome to if you like. I could walk you through it or you could read
> > http://landley.net/toybox/code.html#lib_args and try to work it out yourself.)
> >
> > So if you did want to test this, you'd want to "touch" a test file with a funky
> > utf8 name, make another test file with a long enough name to force minimal
> > spacing, and then... do what? You'd just be proving they did or didn't fit on
> > the same line. (Basically reproducing known/expected output. If they had one
> > space between, these two would fit, with a minimum of two spaces, they'll wrap.)
> >
> > What _is_ a good test is "-C is utf8 fontmetrics, not a byte count". So if you do:
> >
> > $ mkdir sub
> > $ cd sub
> > $ touch $(cat ../tests/files/utf8/test3.txt)
> > $ toybox ls
> >
> > The result fits on one line, when the file was 198 bytes long. (Note: if you
> > don't put quotes around the "$()" it'll split it into 5 words because whitespace
> > in the sentence.
> >
> > That said:
> > $ wc -m ../tests/files/utf8/test3.txt
> > 110 ../tests/files/utf8/test3.txt
> >
> > And I _visually_ count that as 22 characters. (That's the debian wc, not toybox
> > saying that. Toybox matches debian here, but it seems conceptually wrong. It's
> > like we want another option measuring combined characters. Maybe -M or something?)
> >
> > But testing how your terminal displays the result for "ls bridged the space or
> > they're still visually distinct"...? I have no idea how to automate short of
> > doing a screen capture in a known font and comparing the bitmap, which isn't
> > remotely realistic. :)
> >
> > (There are, sadly, a lot of tests I know how to _perform_ but not _automate_.
> > And the pending mkroot stuff merely _reduces_ that gap, not eliminate it.)
> >
> > Rob
> >
> > On 10/17/19 10:44 AM, enh via Toybox wrote:
> > > new tests? (including the difficult combining-character case?)
> > >
> > > On Thu, Oct 17, 2019 at 8:35 AM Andrew Ilijic <ilijic.andrew at gmail.com> wrote:
> > >>
> > >> From d4c03b672d8b18f7ce021e4fcb02e8bb86f38f5f Mon Sep 17 00:00:00 2001
> > >> From: Andrew Ilijic <ilijic.andrew at gmail.com>
> > >> Date: Thu, 17 Oct 2019 11:03:19 -0400
> > >> Subject: [PATCH] ls: Ensure file names are separated by 2 spaces
> > >>
> > >> We need two spaces between filenames because that is the convention
> > >> followed by other implementations. More importantly, if we do not have
> > >> two spaces, certain Unicode file names cause filenames to run together.
> > >> In Unicode, combining characters come before the character they modify.
> > >> If a filename ends in a combining character, the combining character
> > >> attaches to the space that follows it, causing the space not to be
> > >> visible. Having a two-space gap stops the above issue from happening.
> > >>
> > >> For context and a bit more information, see mailing list link below.
> > >> https://www.mail-archive.com/toybox@lists.landley.net/msg05986.html
> > >> ---
> > >>  toys/posix/ls.c | 3 ++-
> > >>  1 file changed, 2 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/toys/posix/ls.c b/toys/posix/ls.c
> > >> index 0956902f..d4c0211a 100644
> > >> --- a/toys/posix/ls.c
> > >> +++ b/toys/posix/ls.c
> > >> @@ -384,7 +384,8 @@ static void listfiles(int dirfd, struct dirtree *indir)
> > >>        memset(colsizes, 0, columns*sizeof(unsigned));
> > >>        for (ul=0; ul<dtlen; ul++) {
> > >>          entrylen(sort[next_column(ul, dtlen, columns, &c)], len);
> > >> -        *len += totpad+1;
> > >> +        // Add `2` to `totpad` to ensure two spaces between filenames
> > >> +        *len += totpad+2;
> > >>          if (c == columns) break;
> > >>          // Expand this column if necessary, break if that puts us over budget
> > >>          if (*len > colsizes[c]) {
> > >> --
> > >> 2.11.0
> > >> _______________________________________________
> > >> Toybox mailing list
> > >> Toybox at lists.landley.net
> > >> http://lists.landley.net/listinfo.cgi/toybox-landley.net
> > > _______________________________________________
> > > Toybox mailing list
> > > Toybox at lists.landley.net
> > > http://lists.landley.net/listinfo.cgi/toybox-landley.net
> > >



More information about the Toybox mailing list