[Toybox] fmt tests.

Rob Landley rob at landley.net
Fri Jun 22 16:06:04 PDT 2018


On 06/22/2018 03:24 PM, enh wrote:
>    ‘fmt’ prefers breaking lines at the end of a sentence, and tries to
> avoid line breaks after the first word of a sentence or before the last
> word of a sentence.  A “sentence break” is defined as either the end of
> a paragraph or a word ending in any of ‘.?!’, followed by two spaces or
> end of line, ignoring any intervening parentheses or quotes.  Like TeX,
> ‘fmt’ reads entire “paragraphs” before choosing line breaks; the
> algorithm is a variant of that given by Donald E. Knuth and Michael F.
> Plass in “Breaking Paragraphs Into Lines”, ‘Software—Practice &
> Experience’ 11, 11 (November 1981), 1119–1184.

So the change of indentation is being interpreted as a paragraph break and
causing it to behave differently. For a definition of differently that seems
more or less random here, but ok.

*shrug* I could implement some sort of "last word ended with ispunct() and the
next word is short and would otherwise be the last word on the line" detection,
but... not well defined and doesn't seem worth it?

The two spaces after period thing went away in the 90's because html squashed
all whitespace into a single space so you'd have to   if you wanted an
extra space after a period, and the tiny minority that bothered circa 1993 got
lost in the noise. After a few years of everybody seeing text with one space
after periods, anything else looked silly. At this point it's been stone dead
for well over a decade.

And when I posted about it on twitter recently somebody pointed out that one
space after period was a macintosh peculiarity (as mentioned in the book "The
Mac is not a Typewriter"), and since Tim Berners-Lee implemented the first web
browser on a NeXT box he might have picked it up from there:

https://twitter.com/steveax/status/1007482609838931969

But then again treating space, runs of space, and newline all the same
(resulting in a single space with line breaks as appropriate) is also really
simple programming, so maybe it was just that. :)

>> If you remove the space after the newline they match, but testing fmt without
>> indentation is missing like half the logic? I made the existing tests pass, but
>> I want to add tests to actually test what the new one is doing, like measuring
>> and preserving tab/space mixes in indents. But fmt turns into weird corner case
>> city. I ran the README and main.c through it when developing it, but that's not
>> a stable test I can put in the test suite...
> 
> yeah, i hit this too, and most of my testing was done manually with
> toybox's README. (sorry, i think the gap between me starting on fmt
> and actually sending it in was large enough that i'd forgotten these
> details.)

*shrug* I'm comfortable enough to promote it, just trying to figure out what the
test cases should be. I wasn't previously a regular user of fmt and dunno what
success looks like here. :)

I should add a test to make sure tabs in front get retained as such though. The
code should be doing it, I just need a test... Um, the other one _is_ doing
that, right?

$ echo -e '\thello\n\tworld' | fmt | hexdump -C
00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
$ echo -e '\thello\n\tworld' | ./fmt | hexdump -C
00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|

Yup, consistency!

$ echo -e '\thello\n        world' | ./fmt | hexdump -C
00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|
$ echo -e '\thello\n        world' | fmt | hexdump -C
00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a           |.hello world.|

Bwahahaha!

Ok, now I'm curious:

$ echo -e '\thello\n        world and then more' | fmt -w 20 | hexdump -C
00000000  09 68 65 6c 6c 6f 0a 09  77 6f 72 6c 64 20 61 6e  |.hello..world an|
00000010  64 0a 09 74 68 65 6e 20  6d 6f 72 65 0a           |d..then more.|
$ echo -e '\thello\n        world and then more' | ./fmt -w 20 | hexdump -C
00000000  09 68 65 6c 6c 6f 20 77  6f 72 6c 64 0a 20 20 20  |.hello world.   |
00000010  20 20 20 20 20 61 6e 64  20 74 68 65 6e 0a 20 20  |     and then.  |
00000020  20 20 20 20 20 20 6d 6f  72 65 0a                 |      more.|

Yeah, they're being less lazy than I am. (I indent with whatever the current
line I'm splitting was used to indent with, provided the whitespace width count
is consistent so it's the same paragraph. They're recording the string the
paragraph _started_ with. I don't think I care enough to fix it, it should
_look_ consistent and the inconsistency was in the input...)

So what happens when... Nope, that's _not_ what they're doing:

$ echo -e '                hello\n\t\tworld and then more' | ./fmt -w 20 |
hexdump -C
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00000010  68 65 6c 6c 6f 0a 09 09  77 6f 72 6c 64 0a 09 09  |hello...world...|
00000020  61 6e 64 0a 09 09 74 68  65 6e 0a 09 09 6d 6f 72  |and...then...mor|
00000030  65 0a                                             |e.|

???

$ echo -e '                hello and then we wrap because\n\t\tworld and then
more' | ./fmt -w 25 | hexdump -C
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00000010  68 65 6c 6c 6f 0a 20 20  20 20 20 20 20 20 20 20  |hello.          |
00000020  20 20 20 20 20 20 61 6e  64 20 74 68 65 6e 0a 20  |      and then. |
00000030  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 77  |               w|
00000040  65 20 77 72 61 70 0a 20  20 20 20 20 20 20 20 20  |e wrap.         |
00000050  20 20 20 20 20 20 20 62  65 63 61 75 73 65 0a 09  |       because..|
00000060  09 77 6f 72 6c 64 0a 09  09 61 6e 64 20 74 68 65  |.world...and the|
00000070  6e 0a 09 09 6d 6f 72 65  0a                       |n...more.|

Nope, not going down this rathole. In the absence of a specification, I think
I'll stick with what I've got.

Planning to cut a release this weekend...

Rob



More information about the Toybox mailing list