[Toybox] fmt tests.

enh enh at google.com
Wed Jun 27 16:25:06 PDT 2018


On Fri, Jun 22, 2018 at 4:06 PM Rob Landley <rob at landley.net> wrote:
>
> On 06/22/2018 03:24 PM, enh wrote:
> >    ‘fmt’ prefers breaking lines at the end of a sentence, and tries to
> > avoid line breaks after the first word of a sentence or before the last
> > word of a sentence.  A “sentence break” is defined as either the end of
> > a paragraph or a word ending in any of ‘.?!’, followed by two spaces or
> > end of line, ignoring any intervening parentheses or quotes.  Like TeX,
> > ‘fmt’ reads entire “paragraphs” before choosing line breaks; the
> > algorithm is a variant of that given by Donald E. Knuth and Michael F.
> > Plass in “Breaking Paragraphs Into Lines”, ‘Software—Practice &
> > Experience’ 11, 11 (November 1981), 1119–1184.
>
> So the change of indentation is being interpreted as a paragraph break and
> causing it to behave differently. For a definition of differently that seems
> more or less random here, but ok.
>
> *shrug* I could implement some sort of "last word ended with ispunct() and the
> next word is short and would otherwise be the last word on the line" detection,
> but... not well defined and doesn't seem worth it?
>
> The two spaces after period thing went away in the 90's because html squashed
> all whitespace into a single space so you'd have to   if you wanted an
> extra space after a period, and the tiny minority that bothered circa 1993 got
> lost in the noise. After a few years of everybody seeing text with one space
> after periods, anything else looked silly. At this point it's been stone dead
> for well over a decade.
>
> And when I posted about it on twitter recently somebody pointed out that one
> space after period was a macintosh peculiarity (as mentioned in the book "The
> Mac is not a Typewriter"), and since Tim Berners-Lee implemented the first web
> browser on a NeXT box he might have picked it up from there:
>
> https://twitter.com/steveax/status/1007482609838931969

i think it might have been an American thing. i first learned this was
a thing from reading Knuth. i don't remember ever having
double-spaced. who could afford that on a 40-column display? but then
i can't be trusted to use capital letters most of the time.

the original fortran source for adventure doesn't double-space :-)

> But then again treating space, runs of space, and newline all the same
> (resulting in a single space with line breaks as appropriate) is also really
> simple programming, so maybe it was just that. :)
>
> >> If you remove the space after the newline they match, but testing fmt without
> >> indentation is missing like half the logic? I made the existing tests pass, but
> >> I want to add tests to actually test what the new one is doing, like measuring
> >> and preserving tab/space mixes in indents. But fmt turns into weird corner case
> >> city. I ran the README and main.c through it when developing it, but that's not
> >> a stable test I can put in the test suite...
> >
> > yeah, i hit this too, and most of my testing was done manually with
> > toybox's README. (sorry, i think the gap between me starting on fmt
> > and actually sending it in was large enough that i'd forgotten these
> > details.)
>
> *shrug* I'm comfortable enough to promote it, just trying to figure out what the
> test cases should be. I wasn't previously a regular user of fmt and dunno what
> success looks like here. :)

as long as i can !!fmt when git commit drops me into vi...

> I should add a test to make sure tabs in front get retained as such though. The
> code should be doing it, I just need a test... Um, the other one _is_ doing
> that, right?
>
> $ echo -e '\thello\n\tworld' | fmt | hexdump -C
> 00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
> $ echo -e '\thello\n\tworld' | ./fmt | hexdump -C
> 00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
>
> Yup, consistency!
>
> $ echo -e '\thello\n        world' | ./fmt | hexdump -C
> 00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
> $ echo -e '\thello\n        world' | fmt | hexdump -C
> 00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
>
> Bwahahaha!
>
> Ok, now I'm curious:
>
> $ echo -e '\thello\n        world and then more' | fmt -w 20 | hexdump -C
> 00000000  09 68 65 6c 6c 6f 0a 09  77 6f 72 6c 64 20 61 6e  |.hello..world an|
> 00000010  64 0a 09 74 68 65 6e 20  6d 6f 72 65 0a           |d..then more.|
> $ echo -e '\thello\n        world and then more' | ./fmt -w 20 | hexdump -C
> 00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a 20 20 20  |.hello world.   |
> 00000010  20 20 20 20 20 61 6e 64  20 74 68 65 6e 0a 20 20  |     and then.  |
> 00000020  20 20 20 20 20 20 6d 6f  72 65 0a                 |      more.|
>
> Yeah, they're being less lazy than I am. (I indent with whatever the current
> line I'm splitting was used to indent with, provided the whitespace width count
> is consistent so it's the same paragraph. They're recording the string the
> paragraph _started_ with. I don't think I care enough to fix it, it should
> _look_ consistent and the inconsistency was in the input...)
>
> So what happens when... Nope, that's _not_ what they're doing:
>
> $ echo -e '                hello\n\t\tworld and then more' | ./fmt -w 20 |
> hexdump -C
> 00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
> 00000010  68 65 6c 6c 6f 0a 09 09  77 6f 72 6c 64 0a 09 09  |hello...world...|
> 00000020  61 6e 64 0a 09 09 74 68  65 6e 0a 09 09 6d 6f 72  |and...then...mor|
> 00000030  65 0a                                             |e.|
>
> ???
>
> $ echo -e '                hello and then we wrap because\n\t\tworld and then
> more' | ./fmt -w 25 | hexdump -C
> 00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
> 00000010  68 65 6c 6c 6f 0a 20 20  20 20 20 20 20 20 20 20  |hello.          |
> 00000020  20 20 20 20 20 20 61 6e  64 20 74 68 65 6e 0a 20  |      and then. |
> 00000030  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 77  |               w|
> 00000040  65 20 77 72 61 70 0a 20  20 20 20 20 20 20 20 20  |e wrap.         |
> 00000050  20 20 20 20 20 20 20 62  65 63 61 75 73 65 0a 09  |       because..|
> 00000060  09 77 6f 72 6c 64 0a 09  09 61 6e 64 20 74 68 65  |.world...and the|
> 00000070  6e 0a 09 09 6d 6f 72 65  0a                       |n...more.|
>
> Nope, not going down this rathole. In the absence of a specification, I think
> I'll stick with what I've got.

the trivial algorithm has been good enough for me since 1992.

> Planning to cut a release this weekend...
>
> Rob



More information about the Toybox mailing list