[Toybox] Generic editor. Was: fold implementation
Robert Thompson
robertt.thompson at gmail.com
Sun Apr 13 15:20:13 PDT 2014
I'm having to fight gmail-web's reformatting more than normal today...
On Sat, Apr 12, 2014 at 9:00 PM, Rob Landley <rob at landley.net> wrote:
> On 04/08/14 23:46, Robert Thompson wrote:
> > In the context of terminal control, "8-bit encoding" is a bit
> > misleading, especially if you're googling for information. The pre-UCS
> > ISO/IEC 2022 character-set extension architecture is actually related,
> > but most of the information you find that way will be irrelevant. In the
> > context of terminal control, the relevant search term seems to be (8-bit
> > terminal control characters) or (8-bit control sequences). The 8-bit
> > control sequences are actually related to ISO 2022 (and the term 'C1
> > codes' derives from this) ... but the documents talking about
> > character-set selection rarely mention anything about terminal control.
>
> I noticed that.
>
> I'm strongly tempted to just treat both escape sequences as "we saw an
> escape sequence" and then the things after that translate the same
> whether they were after the two char or the one char version.
>
> Except that the 8 bit one doesn't seem compatible with UTF-8, and I want
> the code to deal with UTF-8 properly.
>
rxvt-unicode and Apple's Terminal.app are both pretty comprehensively UTF-8
compatible. I can't speak to their sanity in any other area, but both of
them handle UTF-8 display strings quite well. They might make good examples
to follow for minimum UTF-8 breakage.
A detail that might save time for anyone looking into Terminal.app... its
emulation model follows dtterm (and thus the ANSI X3.64-1979 and ISO
6429:1992(E) standards) more than it follows xterm or other emulators...
even though its default advertised TERM value is xterm-color.
>
> > Incidentally, the ISO 2022 standard is also how character-set overlaying
> > was managed in the VT-series terminals. A good example can be found
> > at
> https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets
> > (the rest of the article is less relevant to terminals). There were
> > escape sequences to load one of several alternate C0 sets and escape
> > sequences to load one of several alternate C1 sets. These might change
> > the byte-to-character interpretation, as well as change the symbols
> > displayed for a given codepoint. Hard terminals only usually had a
> > couple of these built in as a build-option, and soft-terminals often
> > don't bother supporting them at all, especially if they're trying for
> > UTF-8 compatibility.
>
> We're trying for UTF-8 compatibility, and I don't think we'd want the
> complexity of implementing state shifting anyway.
> > Certain codes in the C0 range were reserved for control codes (tab,
> > backspace, carriage return, newline, etc). With only one or two
> > exceptions, the C1 codes were not reserved by standard in the same way.
> > However, there were several codes reserved in the C1 range by common
> > usage for terminal-control purposes.
> >
> > The xterm control sequence documentation specifies the 8-bit C1 codes as
>
> Ooh! "xterm control sequence documentation", a google search phrase:
>
> http://invisible-island.net/xterm/ctlseqs/ctlseqs.html
>
> (Helps to know what you're looking for...)
>
> Also turned up:
>
> http://www.x.org/archive/X11R6.8.1/doc/xterm.1.html
>
> Which is less useful. (Tekgronix 4014? Double sized characters? Nope.)
>
> > alternate single-byte codes between 0x84 and 0x9f (within the ISO 2022
> > C1 control plane) that are equivalent to certain two-byte codes
> > beginning with ESC.
>
> Once again, how do they interact with unicode? :)
>
> There are weird unicode keyboard compositing whatsises that say "put a
> diacritical mark over that character you just typed" or "the next
> character will be wearing carmina miranda burana's fruit hat" or stuff.
> I have no _clue_ about this, but will probably need to learn it.
>
Apple apparently spent serious effort getting this to work... If I use the
OSX keyboard composition in a Terminal.app window, I get unicode
characters, and unicode-aware terminal apps receive them undamaged. I have
no clue what they did either, but an example is half the battle ;)
Terminal.app seems to not support meta-keys by default (it suggests in the
help that this is mainly for supporting x11 and "some editors", such as
emacs). You can always fake it by hitting escape,f (when your editor or
whatever expects alt-f), but you have to do it quickly, since the escape
sequence timeout is so short in the modern non-low-baud world. I suspect
that any terminal that supports UTF-8 never sends the 8-bit form of meta
keys.
I am unsure how they handle utf-invalid 8-bit values within other escape
codes. xterm mouse-position escape codes on terminals with width or
height greater than 95 will contain character values > 128, since the
position is encoded as offset+32)
The invisible island xterm control sequence document includes the
mouse-related codes, it turns out.
>
> > Other sources vary slightly, but this source
> > (http://rtfm.etla.org/xterm/ctlseq.html) seems to be the closest to a
> > common superset definition that I've found, with most terminals being
> > more similar to its documented expectations than they are similar to
> > each other.
>
> I think that's the same as the invisible island one google found.
>
Now that you mention it, the invisible island site is FULL of various
collected terminal-related things. Much of it is mirrored from elsewhere,
but that site has a really nice ore-to-dross ratio. Some of it only shows
up with a google site: search though.
>
> > Xterm, dtterm, various VT-series, and several less-common terminals can
> > emit these. In the case of xterm, 8-bit control sequences (or at least
> > xterm's emission of such) is controlled by an option which defaults to
> > generating 7-bit control sequences rather than C1 codes. It is(was?)
> > fairly common for terminal emulators to accept both 7-bit and 8-bit
> > alternate encodings, while only emitting one preferred encoding (usually
> > the 7-bit one).
>
> Gnome has a terminal. XFCE has a terminal. KDE has Konsole.
>
> The thing is, they're all pieces of software. We have a piece of
> software at one end talking to another piece of software at another end,
> and mostly they're filtering through curses, which replaced terminfo,
> which replaced termcap. The actual hardware terminals that sent and
> required specific sequences went away 30 years ago, this entire _layer_
> exists because people can't agree on standard escapes and keep trying to
> humor obscure pieces of software. I'm leaning pretty strongly towards
> "If you can't be bothered to set $TERM to somethign sane, you get to
> keep the pieces", and letting software that can't cope with that just
> die off.
> In theory, the $TERM environment variable tells you want to look for. By
> default on linux it should be set to "linux". The current value on my
> system appears to be "xterm".
> $ grep -lr '"TERM"' toys
> toys/other/login.c
> toys/lsb/su.c
> toys/pending/getty.c
> toys/pending/init.c
> toys/pending/telnet.c
>
> Let's see, in login.c:
>
> if (clear_env) {
> const char * term = getenv("TERM");
> clearenv();
> if (term) setenv("TERM", term, 1);
> }
>
> So a segfault if it's not already set. (Honestly, I should just take the
> commands from my "to review" list predating the pending directory and
> move them to pending. They need the same going-over...)
>
> toys/lsb/su.c:
>
> That does a get_env() that returns NULL if it's not set, and blindly
> does putenv() on that. And the man page does not require putenv to
> accept a NULL without segfaulting or corrupting the enviornment space.
> (Ok, that one's my fault since I already did a cleanup pass. In 5 minute
> chunks scattered over several days, but still.)
>
> toys/pending/getty.c: sets it from command line, doesn't really care
> what it is.
>
> toys/pending/init.c: sets it to "linux" unconditionally... and then
> calls getty with the argument "vt100". What the...? (There's a reason
> it's still in pending...)
>
> toys/pending/telnet.c: reads it, sets it to "" if blank.
>
> So that's examples of "linux", "xterm", and "vt100" in the wild. And I'm
> not too sure about that last one.
>
> Meanwhile, "echo /lib/terminfo/*/* | wc" is finding 39 terminal types
> installed by default on my ubuntu box. Um...
>
> echo /lib/terminfo/*/* | sed 's@[^ ]*/@@g'
> ansi cons25 cons25-debian cygwin dumb Eterm Eterm-color hurd linux mach
> mach-bold mach-color pcansi rxvt rxvt-basic rxvt-m rxvt-unicode screen
> screen-256color screen-256color-bce screen-bce screen-s screen-w sun
> vt100 vt102 vt220 vt52 wsvt25 wsvt25m xterm xterm-256color xterm-color
> xterm-debian xterm-mono xterm-r5 xterm-r6 xterm-vt220 xterm-xfree86
>
> Pretty sure the different xterm variants can go away, cygwin/hurd/mach
> are useless, sun is toast... rxvt needs its own type? Really? What's an
> Eterm? (Eterm-color is a symlink anyway, as is rxvt-m.)
>
eterm is enlightenment terminal. A quick inspection of the escape codes
suggests that it's another VT100/VT220 extended to support variable-size
emulated terminals, and with awareness of many xterm and linux escape
codes. It would probably respond acceptably if TERM were set to linux or
xterm-color.
xterm, xterm-color, xterm-256color are different; xterm doesn't advertise
color support, xterm-256color supports the extended 256-color escape
sequences, and xterm-color terminfos usually don't list the extended escape
sequences (in case your terminal emulator can't actually support them).
Also, it's not uncommon for the xterm terminfo to be limited to a minimal
backwards-compatible monochrome version that will work with a wide range of
xterms. It won't work really well for rxvt, vt100, or (for some things)
linux-console cases.
Screen is interesting, in that it's a vt100 close-derivative that was
consciously chosen to be the subset of vt100 features (automatic margins,
etc) that were easily translated to whatever random terminal the user is
using (via termcap code lookups). As such, it seems to be a fairly neutral
subset that works fairly well in common across most of the vt-derived
terminals, including xterm and linux-console.
Most of the other xterm variants (and all non-vt100 "emulated real
terminal" types) are probably out of scope for projects like toybox and
aboriginal.
>
> My rule of thumb here is if a terminal program doesn't work at the
> command prompt, vi doesn't need to care either...
>
> > Support for 8-bit C1 codes is mostly incompatible with UTF-8, since it
> > is ambiguous in any given environment whether the terminal stream is
> > supposed to be interpreted as an 8-bit stream of byte-sequence control
> > codes interspersed with text interpreted as UTF-8, or a UTF-8 stream of
> > mixed control-codes and text. The first was once considered more logical
> > and was more common, but since it is incompatible with using UTF-aware
> > stream apis, it has become rare. The second is merely complex and
> > inefficient, and not inter-compatible with the first.
>
> Yeah, I suspected as much.
>
> We're not required to work with every insane terminal type out there. We
> can _supply_ a terminal emulator that's not crazy. (Busybox has minicom,
> toybox has a sad little scripts/minicom.sh wrapper script that calls
> stty and netcat -f and really needs to go away and be replaced by a real
> command once toybox has stty.)
>
Picocom is pretty cool for that; it handles the line-discipline and
character-device io, and lets whatever console or emulator that the user is
using for cli access handle the escape codes it receives. It's also quite a
small codebase. If necessary, picocom plus tmux gets you back to the
screen-as-serial-console capability that tmux itself chose not to implement.
https://code.google.com/p/picocom/ It's gplv2...
It also has a minimum-dependency terminal-io handler that might be useful
to look at.
>
> (I have an email thread about genericizing the poll() stuff bookmarked.
> Need to get back to that. I was once composing a long reply, but balsa
> ate it before I switched back to thunderbird which at least saves drafts
> if it crashes.)
>
> > Meta-character user-input handling has a similar but unrelated
> > variation: Assume the user hits meta(or alt)-C. This can be encoded by
> > the terminal as the two bytes ESC,'c' , or it can be encoded as the
> > single byte 0xe3 ( ascii 'c' | 2**7 ).
>
> The linux tty layer eats it and sends us something like SIGINT. We don't
> have to care.
>
That would be true for ctrl-c, but this is meta-c. I probably shouldn't
have used an example so similar to one of the low-level terminal-control
keys. Of course, until someone actually needs to accept meta-keys, it
isn't a problem :) It's only worth mentioning to help avoid
over-simplifying our assumptions in ways that make meta-keys difficult to
support later. However, emacs(ish) editors often depend on meta keys.
Fortunately, a quick source skim suggests that the kernel.org microemacs is
hardwired to 7-bit (escape-prefixed) meta-keys, and doesn't even understand
the 8-bit metakey sequences.
>
> > This detail may follow the
> > terminal settings for 8-bit terminal-control characters, or it may not.
> > My experience is that it usually does, unless the terminal being
> > emulated only had one defined meta-character mode.
>
> Toybox is built around a lot of important simplifying assumptions. We
> depend on the system supporting posix-2008, for example.
>
> When it comes to terminal control, some subset of "vt100", "linux", and
> 'xterm" is probably gonna do us. For output, the ANSI escape sequences
> that DOS understood back in the 1980's cover a multitude of sins.
>
> I just fired up "corbin champion's gnu/root" (which is classic
> shareware, it asks if you want to give him money every time it starts
> up) on my Android phone, and echo $TERM there said "screen".
>
> "screen" has its own terminal type.
>
> Sigh...
>
When you say 'some subset of "vt100", "linux", and "xterm"', what you're in
effect saying is almost exactly what the "screen" terminal-type implements
:) I suspect that there are a few minor places where variance or extension
might be good (I don't know how screen handles mouse-related codes, for
example; the xterm mouse codes are fairly widespread, no matter what the
advertised terminal type is, so we might want to handle them even if we
just consume and discard them.
Screen (and tmux) both use a near-emulation of VT100 as the basis for the
"screen" terminal type.
https://www.gnu.org/software/screen/manual/screen.html#Virtual-Terminal covers
both the terminal expectations/emulation, and (one section further down)
the vt100 keystrings it stuffs when it detects the various
termcap-specified keys (the "outside" terminal can change across
screen-disconnect/screen-reconnect events, and screen wants to send
consistent key strings, and accept consistent escape codes)
I just tested, and setting TERM=screen in an xterm and a OSX Terminal.app
window. Plain out-of-box vim remained functional (without any special
configs or styles). :syntax on did correctly display colors and both
linescroll and pagescroll worked correctly. I seem to remember that this
also mostly worked on the linux console when I tested it a couple of years
ago.
Setting terminal to "vt100", "linux", or "xterm" ("xterm-color" where
necessary) had slightly less-positive results when used cross-terminal.
The things that behave strangely are mostly things that look for certain
$TERM values and enable/disable features based on recognizing certain
terminal names. However, this is less relevant for toybox and aboriginal.
I haven't done exhaustive testing on any of this, and haven't tested
unusual keysyms at all.
If we're lucky, simply hardwiring the terminal and key codes that overlap
between the "screen" terminal, the linux console, and the documented xterm
codes should give us a subset that is very widely supported even on other
VT100-derived terminal emulators. If that doesn't work, perhaps a very
simple implementation of the low-level terminfo primitives (but maybe not
the terminfo *system*) would give us maximum benefit for minimum effort.
>
> > There are significant issues involved in trying to support 8-bit
> > terminal control sequences, 8-bit meta-character sequences and UTF-8.
> > Apparently, due to the move to support UTF on the console, the linux
> > console driver no longer supports the 8-bit terminal-control sequences.
> > man 4 console_codes (or http://linux.die.net/man/4/console_codes)
> > documents this in the Bugs section at the end.
>
> I'm pretty happy to follow Linux's lead on this one.
>
> Ooh, that's a nice man page. (Huh, man7.org is down...)
>
> > The modern version of vttest serves as both a good validation tool and a
> > good documentation-via-code of both the "expected" behavior and the most
> > common variants... it's widely packaged, but the main homepage does have
> > additional useful information (and further
> > references): http://invisible-island.net/vttest/vttest.html It does
> > include keyboard/input tests as well as terminal-control tests, so there
> > should be bits of relevance to an editor.
>
> Cool. Thanks.
>
> > Hope this is useful, or at least not annoyingly redundant.
>
> It's helpful.
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20140413/438076f1/attachment-0005.htm>
More information about the Toybox
mailing list