[Toybox] fmt tests.
enh
enh at google.com
Wed Jun 27 16:25:06 PDT 2018
On Fri, Jun 22, 2018 at 4:06 PM Rob Landley <rob at landley.net> wrote:
>
> On 06/22/2018 03:24 PM, enh wrote:
> > ‘fmt’ prefers breaking lines at the end of a sentence, and tries to
> > avoid line breaks after the first word of a sentence or before the last
> > word of a sentence. A “sentence break” is defined as either the end of
> > a paragraph or a word ending in any of ‘.?!’, followed by two spaces or
> > end of line, ignoring any intervening parentheses or quotes. Like TeX,
> > ‘fmt’ reads entire “paragraphs” before choosing line breaks; the
> > algorithm is a variant of that given by Donald E. Knuth and Michael F.
> > Plass in “Breaking Paragraphs Into Lines”, ‘Software—Practice &
> > Experience’ 11, 11 (November 1981), 1119–1184.
>
> So the change of indentation is being interpreted as a paragraph break and
> causing it to behave differently. For a definition of differently that seems
> more or less random here, but ok.
>
> *shrug* I could implement some sort of "last word ended with ispunct() and the
> next word is short and would otherwise be the last word on the line" detection,
> but... not well defined and doesn't seem worth it?
>
> The two spaces after period thing went away in the 90's because html squashed
> all whitespace into a single space so you'd have to if you wanted an
> extra space after a period, and the tiny minority that bothered circa 1993 got
> lost in the noise. After a few years of everybody seeing text with one space
> after periods, anything else looked silly. At this point it's been stone dead
> for well over a decade.
>
> And when I posted about it on twitter recently somebody pointed out that one
> space after period was a macintosh peculiarity (as mentioned in the book "The
> Mac is not a Typewriter"), and since Tim Berners-Lee implemented the first web
> browser on a NeXT box he might have picked it up from there:
>
> https://twitter.com/steveax/status/1007482609838931969
i think it might have been an American thing. i first learned this was
a thing from reading Knuth. i don't remember ever having
double-spaced. who could afford that on a 40-column display? but then
i can't be trusted to use capital letters most of the time.
the original fortran source for adventure doesn't double-space :-)
> But then again treating space, runs of space, and newline all the same
> (resulting in a single space with line breaks as appropriate) is also really
> simple programming, so maybe it was just that. :)
>
> >> If you remove the space after the newline they match, but testing fmt without
> >> indentation is missing like half the logic? I made the existing tests pass, but
> >> I want to add tests to actually test what the new one is doing, like measuring
> >> and preserving tab/space mixes in indents. But fmt turns into weird corner case
> >> city. I ran the README and main.c through it when developing it, but that's not
> >> a stable test I can put in the test suite...
> >
> > yeah, i hit this too, and most of my testing was done manually with
> > toybox's README. (sorry, i think the gap between me starting on fmt
> > and actually sending it in was large enough that i'd forgotten these
> > details.)
>
> *shrug* I'm comfortable enough to promote it, just trying to figure out what the
> test cases should be. I wasn't previously a regular user of fmt and dunno what
> success looks like here. :)
as long as i can !!fmt when git commit drops me into vi...
> I should add a test to make sure tabs in front get retained as such though. The
> code should be doing it, I just need a test... Um, the other one _is_ doing
> that, right?
>
> $ echo -e '\thello\n\tworld' | fmt | hexdump -C
> 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.|
> $ echo -e '\thello\n\tworld' | ./fmt | hexdump -C
> 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.|
>
> Yup, consistency!
>
> $ echo -e '\thello\n world' | ./fmt | hexdump -C
> 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.|
> $ echo -e '\thello\n world' | fmt | hexdump -C
> 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a |.hello world.|
>
> Bwahahaha!
>
> Ok, now I'm curious:
>
> $ echo -e '\thello\n world and then more' | fmt -w 20 | hexdump -C
> 00000000 09 68 65 6c 6c 6f 0a 09 77 6f 72 6c 64 20 61 6e |.hello..world an|
> 00000010 64 0a 09 74 68 65 6e 20 6d 6f 72 65 0a |d..then more.|
> $ echo -e '\thello\n world and then more' | ./fmt -w 20 | hexdump -C
> 00000000 09 68 65 6c 6c 6f 20 77 6f 72 6c 64 0a 20 20 20 |.hello world. |
> 00000010 20 20 20 20 20 61 6e 64 20 74 68 65 6e 0a 20 20 | and then. |
> 00000020 20 20 20 20 20 20 6d 6f 72 65 0a | more.|
>
> Yeah, they're being less lazy than I am. (I indent with whatever the current
> line I'm splitting was used to indent with, provided the whitespace width count
> is consistent so it's the same paragraph. They're recording the string the
> paragraph _started_ with. I don't think I care enough to fix it, it should
> _look_ consistent and the inconsistency was in the input...)
>
> So what happens when... Nope, that's _not_ what they're doing:
>
> $ echo -e ' hello\n\t\tworld and then more' | ./fmt -w 20 |
> hexdump -C
> 00000000 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
> 00000010 68 65 6c 6c 6f 0a 09 09 77 6f 72 6c 64 0a 09 09 |hello...world...|
> 00000020 61 6e 64 0a 09 09 74 68 65 6e 0a 09 09 6d 6f 72 |and...then...mor|
> 00000030 65 0a |e.|
>
> ???
>
> $ echo -e ' hello and then we wrap because\n\t\tworld and then
> more' | ./fmt -w 25 | hexdump -C
> 00000000 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
> 00000010 68 65 6c 6c 6f 0a 20 20 20 20 20 20 20 20 20 20 |hello. |
> 00000020 20 20 20 20 20 20 61 6e 64 20 74 68 65 6e 0a 20 | and then. |
> 00000030 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 77 | w|
> 00000040 65 20 77 72 61 70 0a 20 20 20 20 20 20 20 20 20 |e wrap. |
> 00000050 20 20 20 20 20 20 20 62 65 63 61 75 73 65 0a 09 | because..|
> 00000060 09 77 6f 72 6c 64 0a 09 09 61 6e 64 20 74 68 65 |.world...and the|
> 00000070 6e 0a 09 09 6d 6f 72 65 0a |n...more.|
>
> Nope, not going down this rathole. In the absence of a specification, I think
> I'll stick with what I've got.
the trivial algorithm has been good enough for me since 1992.
> Planning to cut a release this weekend...
>
> Rob
More information about the Toybox
mailing list