[Toybox] Generic editor. Was: fold implementation

Rob Landley rob at landley.net
Sat Apr 12 19:00:37 PDT 2014


On 04/08/14 23:46, Robert Thompson wrote:
> In the context of terminal control, "8-bit encoding" is a bit
> misleading, especially if you're googling for information. The pre-UCS
> ISO/IEC 2022 character-set extension architecture is actually related,
> but most of the information you find that way will be irrelevant. In the
> context of terminal control, the relevant search term seems to be (8-bit
> terminal control characters) or (8-bit control sequences). The 8-bit
> control sequences are actually related to ISO 2022 (and the term 'C1
> codes' derives from this) ... but the documents talking about
> character-set selection rarely mention anything about terminal control.

I noticed that.

I'm strongly tempted to just treat both escape sequences as "we saw an
escape sequence" and then the things after that translate the same
whether they were after the two char or the one char version.

Except that the 8 bit one doesn't seem compatible with UTF-8, and I want
the code to deal with UTF-8 properly.

> Incidentally, the ISO 2022 standard is also how character-set overlaying
> was managed in the VT-series terminals. A good example can be found
> at https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets
> (the rest of the article is less relevant to terminals). There were
> escape sequences to load one of several alternate C0 sets and escape
> sequences to load one of several alternate C1 sets. These might change
> the byte-to-character interpretation, as well as change the symbols
> displayed for a given codepoint. Hard terminals only usually had a
> couple of these built in as a build-option, and soft-terminals often
> don't bother supporting them at all, especially if they're trying for
> UTF-8 compatibility.

We're trying for UTF-8 compatibility, and I don't think we'd want the
complexity of implementing state shifting anyway.

> Certain codes in the C0 range were reserved for control codes (tab,
> backspace, carriage return, newline, etc). With only one or two
> exceptions, the C1 codes were not reserved by standard in the same way.
> However, there were several codes reserved in the C1 range by common
> usage for terminal-control purposes.
> 
> The xterm control sequence documentation specifies the 8-bit C1 codes as

Ooh! "xterm control sequence documentation", a google search phrase:

http://invisible-island.net/xterm/ctlseqs/ctlseqs.html

(Helps to know what you're looking for...)

Also turned up:

http://www.x.org/archive/X11R6.8.1/doc/xterm.1.html

Which is less useful. (Tekgronix 4014? Double sized characters? Nope.)

> alternate single-byte codes between 0x84 and 0x9f (within the ISO 2022
> C1 control plane)  that are equivalent to certain two-byte codes
> beginning with ESC.

Once again, how do they interact with unicode? :)

There are weird unicode keyboard compositing whatsises that say "put a
diacritical mark over that character you just typed" or "the next
character will be wearing carmina miranda burana's fruit hat" or stuff.
I have no _clue_ about this, but will probably need to learn it.

> Other sources vary slightly, but this source
> (http://rtfm.etla.org/xterm/ctlseq.html) seems to be the closest to a
> common superset definition that I've found, with most terminals being
> more similar to its documented expectations than they are similar to
> each other. 

I think that's the same as the invisible island one google found.

> Xterm, dtterm, various VT-series, and several less-common terminals can
> emit these. In the case of xterm, 8-bit control sequences (or at least
> xterm's emission of such) is controlled by an option which defaults to
> generating 7-bit control sequences rather than C1 codes. It is(was?)
> fairly common for terminal emulators to accept both 7-bit and 8-bit
> alternate encodings, while only emitting one preferred encoding (usually
> the 7-bit one).

Gnome has a terminal. XFCE has a terminal. KDE has Konsole.

The thing is, they're all pieces of software. We have a piece of
software at one end talking to another piece of software at another end,
and mostly they're filtering through curses, which replaced terminfo,
which replaced termcap. The actual hardware terminals that sent and
required specific sequences went away 30 years ago, this entire _layer_
exists because people can't agree on standard escapes and keep trying to
humor obscure pieces of software. I'm leaning pretty strongly towards
"If you can't be bothered to set $TERM to somethign sane, you get to
keep the pieces", and letting software that can't cope with that just
die off.

In theory, the $TERM environment variable tells you want to look for. By
default on linux it should be set to "linux". The current value on my
system appears to be "xterm".

$ grep -lr '"TERM"' toys
toys/other/login.c
toys/lsb/su.c
toys/pending/getty.c
toys/pending/init.c
toys/pending/telnet.c

Let's see, in login.c:

  if (clear_env) {
    const char * term = getenv("TERM");
    clearenv();
    if (term) setenv("TERM", term, 1);
  }

So a segfault if it's not already set. (Honestly, I should just take the
commands from my "to review" list predating the pending directory and
move them to pending. They need the same going-over...)

toys/lsb/su.c:

That does a get_env() that returns NULL if it's not set, and blindly
does putenv() on that. And the man page does not require putenv to
accept a NULL without segfaulting or corrupting the enviornment space.
(Ok, that one's my fault since I already did a cleanup pass. In 5 minute
chunks scattered over several days, but still.)

toys/pending/getty.c: sets it from command line, doesn't really care
what it is.

toys/pending/init.c: sets it to "linux" unconditionally... and then
calls getty with the argument "vt100". What the...? (There's a reason
it's still in pending...)

toys/pending/telnet.c: reads it, sets it to "" if blank.

So that's examples of "linux", "xterm", and "vt100" in the wild. And I'm
not too sure about that last one.

Meanwhile, "echo /lib/terminfo/*/* | wc" is finding 39 terminal types
installed by default on my ubuntu box. Um...

echo /lib/terminfo/*/* | sed 's@[^ ]*/@@g'
ansi cons25 cons25-debian cygwin dumb Eterm Eterm-color hurd linux mach
mach-bold mach-color pcansi rxvt rxvt-basic rxvt-m rxvt-unicode screen
screen-256color screen-256color-bce screen-bce screen-s screen-w sun
vt100 vt102 vt220 vt52 wsvt25 wsvt25m xterm xterm-256color xterm-color
xterm-debian xterm-mono xterm-r5 xterm-r6 xterm-vt220 xterm-xfree86

Pretty sure the different xterm variants can go away, cygwin/hurd/mach
are useless, sun is toast... rxvt needs its own type? Really? What's an
Eterm? (Eterm-color is a symlink anyway, as is rxvt-m.)

My rule of thumb here is if a terminal program doesn't work at the
command prompt, vi doesn't need to care either...

> Support for 8-bit C1 codes is mostly incompatible with UTF-8, since it
> is ambiguous in any given environment whether the terminal stream is
> supposed to be interpreted as an 8-bit stream of byte-sequence control
> codes interspersed with text interpreted as UTF-8, or a UTF-8 stream of
> mixed control-codes and text. The first was once considered more logical
> and was more common, but since it is  incompatible with using UTF-aware
> stream apis, it has become rare. The second is merely complex and
> inefficient, and not inter-compatible with the first.

Yeah, I suspected as much.

We're not required to work with every insane terminal type out there. We
can _supply_ a terminal emulator that's not crazy. (Busybox has minicom,
toybox has a sad little scripts/minicom.sh wrapper script that calls
stty and netcat -f and really needs to go away and be replaced by a real
command once toybox has stty.)

(I have an email thread about genericizing the poll() stuff bookmarked.
Need to get back to that. I was once composing a long reply, but balsa
ate it before I switched back to thunderbird which at least saves drafts
if it crashes.)

> Meta-character user-input handling has a similar but unrelated
> variation:  Assume the user hits meta(or alt)-C. This can be encoded by
> the terminal as the two bytes ESC,'c' , or it can be encoded as the
> single byte 0xe3 ( ascii 'c' | 2**7 ).

The linux tty layer eats it and sends us something like SIGINT. We don't
have to care.

> This detail may follow the
> terminal settings for 8-bit terminal-control characters, or it may not.
> My experience is that it usually does, unless the terminal being
> emulated only had one defined meta-character mode.

Toybox is built around a lot of important simplifying assumptions. We
depend on the system supporting posix-2008, for example.

When it comes to terminal control, some subset of "vt100", "linux", and
'xterm" is probably gonna do us. For output, the ANSI escape sequences
that DOS understood back in the 1980's cover a multitude of sins.

I just fired up "corbin champion's gnu/root" (which is classic
shareware, it asks if you want to give him money every time it starts
up) on my Android phone, and echo $TERM there said "screen".

"screen" has its own terminal type.

Sigh...

> There are significant issues involved in trying to support 8-bit
> terminal control sequences, 8-bit meta-character sequences and UTF-8.
> Apparently, due to the move to support UTF on the console, the linux
> console driver no longer supports the 8-bit terminal-control sequences.
> man 4 console_codes (or http://linux.die.net/man/4/console_codes)
> documents this in the Bugs section at the end.

I'm pretty happy to follow Linux's lead on this one.

Ooh, that's a nice man page. (Huh, man7.org is down...)

> The modern version of vttest serves as both a good validation tool and a
> good documentation-via-code of both the "expected" behavior and the most
> common variants... it's widely packaged, but the main homepage does have
> additional useful information (and further
> references): http://invisible-island.net/vttest/vttest.html It does
> include keyboard/input tests as well as terminal-control tests, so there
> should be bits of relevance to an editor.

Cool. Thanks.

> Hope this is useful, or at least not annoyingly redundant.

It's helpful.

Rob

 1397354437.0


More information about the Toybox mailing list