[Toybox] Generic editor. Was: fold implementation

Robert Thompson robertt.thompson at gmail.com
Tue Apr 8 21:46:21 PDT 2014


If this message isn't saving you time researching details, feel free to
skip it. It's just references, details, and background that might be useful
to someone, or might not.

In the context of terminal control, "8-bit encoding" is a bit misleading,
especially if you're googling for information. The pre-UCS ISO/IEC 2022
character-set extension architecture is actually related, but most of the
information you find that way will be irrelevant. In the context of
terminal control, the relevant search term seems to be (8-bit terminal
control characters) or (8-bit control sequences). The 8-bit control
sequences are actually related to ISO 2022 (and the term 'C1 codes' derives
from this) ... but the documents talking about character-set selection
rarely mention anything about terminal control.

Incidentally, the ISO 2022 standard is also how character-set overlaying
was managed in the VT-series terminals. A good example can be found at
https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets(the
rest of the article is less relevant to terminals). There were escape
sequences to load one of several alternate C0 sets and escape sequences to
load one of several alternate C1 sets. These might change the
byte-to-character interpretation, as well as change the symbols displayed
for a given codepoint. Hard terminals only usually had a couple of these
built in as a build-option, and soft-terminals often don't bother
supporting them at all, especially if they're trying for UTF-8
compatibility.

Certain codes in the C0 range were reserved for control codes (tab,
backspace, carriage return, newline, etc). With only one or two exceptions,
the C1 codes were not reserved by standard in the same way. However, there
were several codes reserved in the C1 range by common usage for
terminal-control purposes.

The xterm control sequence documentation specifies the 8-bit C1 codes as
alternate single-byte codes between 0x84 and 0x9f (within the ISO 2022 C1
control plane)  that are equivalent to certain two-byte codes beginning
with ESC. Other sources vary slightly, but this source (
http://rtfm.etla.org/xterm/ctlseq.html) seems to be the closest to a common
superset definition that I've found, with most terminals being more similar
to its documented expectations than they are similar to each other.
Xterm, dtterm, various VT-series, and several less-common terminals can
emit these. In the case of xterm, 8-bit control sequences (or at least
xterm's emission of such) is controlled by an option which defaults to
generating 7-bit control sequences rather than C1 codes. It is(was?) fairly
common for terminal emulators to accept both 7-bit and 8-bit alternate
encodings, while only emitting one preferred encoding (usually the 7-bit
one).


Support for 8-bit C1 codes is mostly incompatible with UTF-8, since it is
ambiguous in any given environment whether the terminal stream is supposed
to be interpreted as an 8-bit stream of byte-sequence control codes
interspersed with text interpreted as UTF-8, or a UTF-8 stream of mixed
control-codes and text. The first was once considered more logical and was
more common, but since it is  incompatible with using UTF-aware stream
apis, it has become rare. The second is merely complex and inefficient, and
not inter-compatible with the first.

Meta-character user-input handling has a similar but unrelated variation:
 Assume the user hits meta(or alt)-C. This can be encoded by the terminal
as the two bytes ESC,'c' , or it can be encoded as the single byte 0xe3 (
ascii 'c' | 2**7 ). This detail may follow the terminal settings for 8-bit
terminal-control characters, or it may not. My experience is that it
usually does, unless the terminal being emulated only had one defined
meta-character mode.

There are significant issues involved in trying to support 8-bit terminal
control sequences, 8-bit meta-character sequences and UTF-8. Apparently,
due to the move to support UTF on the console, the linux console driver no
longer supports the 8-bit terminal-control sequences. man 4 console_codes
(or http://linux.die.net/man/4/console_codes) documents this in the Bugs
section at the end.

The modern version of vttest serves as both a good validation tool and a
good documentation-via-code of both the "expected" behavior and the most
common variants... it's widely packaged, but the main homepage does have
additional useful information (and further references):
http://invisible-island.net/vttest/vttest.html It does include
keyboard/input tests as well as terminal-control tests, so there should be
bits of relevance to an editor.


Hope this is useful, or at least not annoyingly redundant.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20140408/e9981cd0/attachment-0005.htm>


More information about the Toybox mailing list