[Toybox] New diff.c

Rob Landley rob at landley.net
Mon Aug 8 01:14:29 PDT 2022


After wrestling with diff on and off for weeks (I can see WHAT it's doing but
not WHY) I broke down and wrote a new one (attached, very preliminary). Dunno
yet if I'll use it instead of the old one, but I understand how this one _works_
which still isn't true for the conventional algorithms despite much squinting
and reading old papers and people's blog posts about them.

This diff is streaming, like my patch implementation: it accumulates changed
lines for the current hunk and frees them again when it's done with them. Thus
its memory and CPU consumption scale with the size of the _hunk_, not the size
of the file. Between hunks it's only retaining as much trailing context as it
needs to start a new hunk if the next line differs. (Sometimes it's working
through a backlog of lines it preloaded looking for the end of the previous
hunk, but it's freeing lines as it goes between hunks when that's the case,
working DOWN from a memory consumption high water mark.)

The overall flow control is diff_main() calls find_hunks() to find hunks, which
calls dump_hunk() to display each hunk as a series of +, -, and context lines.
All line comparisons are done by diffcmp() which takes all the flags into
account (whitespace, case sensitivity, stripping newlines). Lines are read by
get_lines() and freed by trim_lines().

struct half_hunk contains all the info we currently know about one of the input
files: its name, its FILE *, the array of lines we've read from it so far but
not freed yet, what line number we're up to, etc. There are generally two
instances of this: h1 is the minus half and h2 is the plus half.

Details: diff_main() currently opens two input files (using hunky("filname"))
and calls find_hunks() on them. It's a stub: I still need to add the directory
traversal logic back, it's not setting a return code...

find_hunks() traverses two input files to find differences. Between hunks it
calls get_line() on each half_hunk to append one line (or detect EOF). If the
new lines match according to diffcmp() it keeps the last -U many context lines
(to start a new hunk with as necessary) and frees the rest via trim_hunk(). If
both files end at the same time during this mode, it exits the loop (which will
free both half_hunk structures and their contents).

If the search loop in find_hunks() doesn't get a pair of matching lines
(possibly because one file ended before the other) it alternates loading one
line from each input (that hasn't ended yet) and calling back_search() to see if
this new line ends the hunk. (If so, the shorter half-hunk will unget lines so
it ends in the appropriate spot, and future get_line() calls will add them back
as necessary.) If both inputs end, that also ends an in-progress hunk.

back_search() looks for 2 times -U many matching lines from one half_hunk
somewhere in the other because that's what it takes to break a unified diff
hunk. Since hunks occur in order and can't overlap, that means you can't END a
hunk until you can start the next one, or have hit the end of the file. Yes this
means with -U 3 you can have up to 5 consecutive matching lines in the middle of
a hunk, try it:

$ diff -u <(echo axbxcxdxexfxgxhxi | tr x '\n') <(echo axBxcxdxexfxgxHxi | tr x
'\n')
--- /dev/fd/63	2022-08-08 01:35:42.463922199 -0500
+++ /dev/fd/62	2022-08-08 01:35:42.463922199 -0500
@@ -1,9 +1,9 @@
 a
-b
+B
 c
 d
 e
 f
 g
-h
+H
 i

find_hunks() alternates adding a new line to the end of each half_hunk and
calling back_search on the other half_hunk because we don't know ahead of time
whether we're net adding or removing lines, and thus which half_hunk will be
shorter. This is why we need the unget plumbing using the "saved" member of
struct half_hunk, because we're likely to overshoot on one of them and must save
those extra lines after this hunk until they're needed later.

Once find_hunks() has delineated a hunk it calls dump_hunk() to display the hunk
as +, -, and context lines. Since lines are displayed in order and context lines
match in both hunks, this means identifying context lines. This part sucks and
is darn near AI complete to do properly: what I did was:

1) glue both half hunks together and qsort() them, identifying and tagging
matches. (And yes "match" requires a matching line in the OTHER half_hunk, not
just repeated in this half_hunk.)

2) Advance forward in both half_hunks simultaneously looking for the next marked
line in each. Marked lines MAY be matched as context lines, but it doesn't mean
they will. Once we have the next pair or marked lines, simultaneously search
forward checking the OTHER half_hunk for a marked line that diffcmp() says
matches this one. This means we keep the closer match in either direction. If we
hit the end of both half_hunks before finding a match for either line, we unmark
them both as matched lines and start over. If we hit the end of both files, then
set "next context line" to the length of the hunk.

3) Emit - lines from h1 up to the next context line. Emit + lines from h2 up to
the next context line. If we're not at the end of the hunk, emit the next
context line from h1 and skip it in both h1 and h2. Loop back to the start of
(2) above.

Yes I can come up with a pathological input that breaks this. Probably the "if
we hit the end of both half_hunks" above should be cleverer, hitting the end of
just ONE half_hunk means skip that marked line and start its search over,
traversing as far as the other side's searched before going back to alternating...

The trick to making dump_hunk() work is an extra layer of indirection. Each
half_hunk stores its lines as an array of char pointers using the members char
**str and unsigned len. dump_hunk() collates both half_hunks into char ***big
which is an array of pointers to each line in that hunk's str array, and then
calls qsort() on that to find duplicate pairs. The extra layer of indirection
lets us:

A) figure out which hunk the string came from after the fact (is it >= h1->str
and <= h2->str+h1->len) so matches only count if it's got an entry in both
half_hunks.

B) Mark the matches by modifying each string pointer in the half_hunk that has a
match, by setting its bottom bit which is otherwise guaranteed to be zero by
malloc() returning word-aligned memory. We do this by dereferencing the char ***
enough times to incrementing the relevant char * pointer by one.)

The display loop at the end calls a squish() function that unmarks the line
(checks the bottom bit and decrements the pointer if it's set). We need to do
this before we can free() the line.

Anyway, the attached is still buggy and incomplete, but it gives the general
idea. If this ACTUALLY pending code this would be at least my third checkin of
the new plumbing, but since people are building and using stuff out of pending I
can't replace it until the new stuff fully works, which means you don't get to
see all the development history. So here's a checkpoint since I've been kinda
quiet otherwise. :)

I need to add the above 5 context line test to the test suite, and add a test
with 5 context lines at the end of the file (not enough to trigger back_search()
but more than -U trailing context), and... so many tests needed for this to be
load bearing.

My REAL test is probably doing a "git log" of my fullhist kernel tree back to
0.0.1 and having two parallel directories one version apart so this diff can
compare each commit with the next commit. I don't expect the resulting diffs to
be identical to what git produces, but my "patch" should be able to apply the
resulting diff so that "git reset $COMMIT" followed by "git diff" says there are
no differences. And my patches shouldn't be significantly LONGER than git's
patches in line count, although I dunno about that last part...

Rob

P.S. Seriously, proper context line detection is painful. I had a whole "find
longest span of consecutive matches" approach I ripped out to do this one
instead, because there was ALWAYS a pathological input I could construct that
made it give the wrong answer. This one can do that too but at least it's CHEAP,
assuming qsort() isn't pathological on the hunk...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff.c
Type: text/x-csrc
Size: 9173 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20220808/fc9d35e5/attachment.c>


More information about the Toybox mailing list