[Toybox] [PATCH] vi: Replace linelist with mem_block based design

Fri Jan 17 09:48:11 PST 2020

On Fri, Jan 17, 2020 at 1:10 AM Jarno Mäkipää <jmakip87 at gmail.com> wrote:
>
> Hi
>
> On Fri, Jan 17, 2020 at 8:13 AM enh <enh at google.com> wrote:
> >
> > On Thu, Jan 16, 2020 at 9:59 AM Jarno Mäkipää <jmakip87 at gmail.com> wrote:
> > >
> > > hey
> > >
> > > On Thu, Jan 16, 2020 at 6:28 PM enh <enh at google.com> wrote:
> > > >
> > > > my 2c: i wouldn't do anything this complicated. every editor i've
> > > > written or worked on in the past used the much simpler scheme of one
> > > > buffer with a gap at the most recent insertion point
> > > > [https://en.wikipedia.org/wiki/Gap_buffer]. the most extra complexity
> > > > i've seen this cause is a separate array of line breaks, but since we
> > > > have the busybox vi style of not doing line wrapping, that seems
> > > > unnecessary?
> > >
> > > I actually found this approach much simpler to implement than atleast
> > > the dlist based one I previously had.
> >
> > oh, yeah, i agree that the "list of lines" is the worst option.
> > especially if you want to scale to big files.
>
> Yeah I think we are both in same page that list of lines needs to go.
> Its easy to draw and scroll but its very difficult to do inserts and
> cuts that are not full lines and searching is slow.
>
> >
> > > Difficulties compared to linked
> > > list of lines, is scrolling down or counting line numbers. But
> > > inserting and cutting text is actually much easier. I think Gap_buffer
> > > shares same difficulties without benefit of loading file fast, or
> > > handling big files. I can load 8gb file with this patch in blink of
> > > eye.
> >
> > that's true for gap buffer too. traditionally you just put the gap at
> > the end. (but if there's one thing computers are good at, it's copying
> > data around. and the gap amortizes that anyway, because you only move
> > the "after" part every "gap_length" bytes.)
> >
> > > Reason why I started doing this was that I wanted to do undo, with
> > > infinite history. And could not figure out easy way with linelist. Now
> > > undo history should be really easy to make, by just having separate
> > > slice_list of cut/insert sections in order.
> >
> > this is easy with the gap buffer too. the best trick i learned was
> > from Rob Pike's various editors (you can read the sam
> > [http://doc.cat-v.org/plan_9/4th_edition/papers/sam/] and acme
> > [http://doc.cat-v.org/plan_9/4th_edition/papers/acme/] papers on the
> > web): there's only one edit, and that's replace-range(start_offset,
> > end_offset, bytes). insert just has a start == end, delete has no
> > bytes, and replace is replace. and there's a trivial transformation
> > from each operation to its "undo". and you can put the undone edits on
> > the redo stack too if you want.
>
> I looked into Gap buffer and Piece table more now. I think what I did
> was Piece table without knowing it. My piece table is slice_list and i
> just combined source and start to be just one pointer, to origin/add,
> and I could have multiple origin files mmap() ed. (ex command :r could
> just mmap new region extending add list)
> https://darrenburns.net/posts/piece-table/
>
> I think both has ups and downs, piece table is better for large files,
> and also for doing repeated inserts in random parts of file, such as
> regexp find and replace.
>
> Gap buffer is better for doing lots of inserts around same area.

also known as "typing", the common case :-)

> Also
> searching is easier since it has 2 continuous regions.

historically i've actually just cheated and moved the gap to the end
of the file. that lets you just use existing regex implementations
without having to try to get them to work across gaps. afaik, a piece
table will require a regular expression API that takes something like
a Java CharSequence where you can hide that the bytes aren't actually
consecutive. which would be fine if you weren't trying to use C.

> Also data does
> not need to be copied as much for drawing, only around gap. Good for
> small files. Cant really open large files since it needs heap of
> filesize+gap or perhaps as read-only gap can be zero and memory be
> mmap() into file.

gap is usually small. humans can't type fast, and computers can copy
large blocks of memory *very* fast, so even 128 bytes or whatever
amortizes fine. (you can special case "paste" so it doesn't get
pathological on large pastes.)

> both suck at jumping into line number

but piece table sucks a lot harder, because you have to effectively
apply all the edits to work out where you are.

> Ideally commands such as less should share vi code just editing
> commands disabled, so buffer opening, drawing, searching, cursoring
> and jumping should be same.
>
> Right now I only have 2 functions for insert and cut, and i think its
> all needed.

like i said (you should read those papers), you actually only need
"replace". doesn't make much difference yet (and i've certainly had
one-line helpers to translate insert and cut in replace historically),
but only having one primitive edit makes undo/redo easier when you get
to them.

> int insert_str(const char *data, size_t offset, size_t size, size_t len,
>   enum alloc_flag type)
> int cut_str(size_t offset, size_t len)
>
> and few commands for searching line breaks, copying current line,
> checking that current codepoint is valid for cursoring...
> size_t text_strchr(size_t offset, char c)
> size_t text_strrchr(size_t offset, char c)
> size_t text_filesize()
> char text_byte(size_t offset)
> int text_codepoint(char *dest, size_t offset)
> size_t text_getline(char *dest, size_t offset, size_t max_len)
>
> So I think now that I added extra layer we can still later switch into
> gap_buffer, if we dont touch directly into buffer handing in actual vi
> commands? (and some of this abstraction can be removed when cleaning
> up codebase after we are satisfied what type of buffer we should use)

personally i'd go the other way. gap buffer is a lot easier, and you
can always switch to a piece table if/when it proves necessary. if we
were aiming for syntax coloring, say, that would be a mark in favor of
the piece table. but aiui we're not [and i've seen a separate table of
style runs work fine/better for that in the past anyway]. like i said,
it would be good to explicitly agree the goals/non-goals with rob.

> > > > i do know that Visual Studio Code started with an array of lines and
> > > > switched to a piece table (common in word processors)
> > > > [https://en.wikipedia.org/wiki/Piece_table], but then you end up
> > > > needing a cache to make that fast, and...
> > > > https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation
> > >
> > > Actually I think this method i did is Piece table? I didint know what
> > > to call it.
> > >
> > > >
> > > > On Thu, Jan 16, 2020 at 8:17 AM Rob Landley <rob at landley.net> wrote:
> > > > >
> > > > > On 1/16/20 7:58 AM, Jarno Mäkipää wrote:
> > > > > > Replaced dlist linelist with continuous memory blocks. This will allow
> > > > > > editing huge files without billion mallocs.
> > > > >
> > > > > More or less what I was trying to implement back when vi was a thing I was going
> > > > > to work on.
> > > > >
> > > > > > File is first opened with
> > > > > > mmap() and mapped region is fully described in block_list as one block.
> > > > >
> > > > > When I tried this it turned into a complex mess of different objects with
> > > > > different allocation/lifetime rules that had to interact. I intended to cap the
> > > > > block size because splitting becomes painful otherwise (and then wound up having
> > > > > strings individually allocated with the array of pointers to the start of
> > > > > strings being a seperate allocation), and of course mmap() requiring the offset
> > > > > to be a multiple of page size combined badly with line starts being all over the
> > > > > place...
> > > > >
> > > > > (I never did figure out what to do about a file that changed out from under you
> > > > > while you were editing it. The MAP_SHARED version has the changes invalidate
> > > > > your indexes and segfault, especially if it's truncated out from under you. The
> > > > > MAP_PRIVATE version doesn't work on nommu, I believe still has the truncate
> > > > > problem, and doesn't help you with parts it hasn't faulted in until after the
> > > > > file changed.)
> >
> > yeah, as far as i know there's no fix to that. the only 8GiB i can
> > imagine anyone loading into vi would be a log, and logs get truncated.
> >
> > is editing 8GiB text files on machines with <8GiB a goal or a
> > non-goal? (i know hexedit supports this, but i think it seems more
> > likely there: editing a disk image on a machine, say, because most
> > devices of all scales have a lot more disk than ram. but a text
> > file...?)
>
> When I originally talked to Rob around year ago he wanted to support
> big files. And this would be useful if vi would share codebase
> commands like less.
>
> >
> > > I had idea that I could just heap allocate and load file with read()
> > > as fallback or if file is small. Implementation does not care what
> > > type of memory original block is. Only unload on closing needs to call
> > > either free() or munmap()
> > >
> > > > >
> > > > > > Currently "valid" data is described as slices, when first loading file
> > > > > > there is only one slice that points to memory existing in block_list.
> > > > > > When cutting text, block_list is not freed or modified, but instead
> > > > > > slice_list is modified to have "hole" between 2 slices. when inserting
> > > > > > new mem_block is added, previos slices are cut in cursor position and
> > > > > > new slice is added...
> > > > >
> > > > > *shrug* I leave this to you.
> > > > >
> > > > > > Implementation is not as finished as old linelist, but most of the heavy
> > > > > > lifting is done so wanted to share this at this state.
> > > > > >
> > > > > > Added functions to handling data inside block_list+slice_list
> > > > > >
> > > > > > insert_str(), cut_str() are used for all delete and add operations
> > > > > > text_strrchr(), text_strchr() are used for searching lineendings
> > > > > > text_byte(), text_codepoint(), text_getline() are for simple data access
> > > > > >
> > > > > > Implemented: yank,delete,insert,push and most of the previos moves
> > > > > >
> > > > > > Implemented partly: saving of file, since current file is still mmaped
> > > > > > cant save on top of it,
> > > > >
> > > > > You never want to do that anyway because a crash halfway through (battery dies,
> > > > > etc) leaves you a corrupted file.
> > > > >
> > > > > > so now saves into "filename".swp, later need
> > > > > > to move .swp into original file after munmap
> > > > >
> > > > > rename() is atomic.
> > >
> > > Yeah i was thinking of saving temp file and then unloading mmap and
> > > original file (and rest of my allocated blocks) and then rename() and
> > > for extra cheese maybe file permissions and such need to be copied
> > > from original file...
> >
> > i think vim does the opposite --- it renames the existing file, then
> > writes to the old file name, then unlinks the renamed original if
> > everything goes okay.
> >
> > > > > > Not Implemented: search, linenumber counting, etc..
> > > > > >
> > > > > > Removed: Some unused functions
> > > > > >
> > > > > > br
> > > > > >
> > > > > > -Jarno
> > > > >
> > > > > Do I apply this or not?
> > > > >
> > > > > Rob
> > >
> > > Its your call, this still needs more work, but I think its better
> > > design than dlist one.
> > >
> > > -Jarno
> > >
> > > > > _______________________________________________
> > > > > Toybox mailing list
> > > > > Toybox at lists.landley.net
> > > > > http://lists.landley.net/listinfo.cgi/toybox-landley.net