[Toybox] More than you really wanted to know about patch.

Sun Jan 13 21:30:12 PST 2019

On 1/13/19 3:57 PM, scsijon wrote:
> Any chance of a two or three page "Introduction to Creating and Understanding
> Patches for Dummies" for those of us who either don't know how to build one, or
> like me, have, "but don't really know what i'm doing".
>
> When you can make time of course, i'd really like to understand more of what the
> group is doing with patches submitted rather than only a little.
> 
> Please, with pure honey on crumpets.
Patches are reasonably straightforward, if somewhat reverse engineered historically.

Back in the 1980's somebody invented diff -u ("unified diff format") as a more
human readable alternative o the <old >new lines format you get without the -u,
and then Larry Wall whipped up a program to reverse the process and use saved
diff -u output to modify a file (which was mind-blowing at the time). As far as
I can tell the format wasn't really meant for that, and was made to work with
heuristics and hitting it with a rock, but Larry _did_ go on to invent Perl...

A patch is a series of "hunks", describing a range of lines in the "old" version
and the corresponding range in the "new" version. Patches have 6 different types
of lines, each starting with one of "+++ ", "--- ", "@@ ", " ", "+", or "-".

The first 2 (the --- and +++ lines) are control lines that come at the start and
indicate we're working on a new file. They indicate the old file name and the
new file name for the changed files. If you "diff -u oldfile newfile" you get a
hunk starting with:

  --- oldfile
  +++ newfile
  @@ -oldstart,oldlines +newstart,newlines @@ comment
  and so on

Those first two lines are --- or +++, one space, and the filename.

Unfortunately, the original unified diff format then followed each filename with
a tab character and the timestamp of the file (in yyyy-mm-dd hh:mm:ss tzoff"
format), which means if you have a tab character in the filename you can't patch
them. These days this datestamp is optional, and most patches don't have them
anymore. (I have a todo item to make toybox patch work backwards from the end of
the line and peel off only a properly formatted tab+date entry and leave it
alone otherwise, but right now it just stops at the first tab. Which is not a
space or newline, and thus almost never occurs in filenames and nobody's
complained yet (because if you tab in the windows gui it switches focus so
windows people can't trivially create this breakage and then wine for us to
"support" it)...  Still, lemme do a quick commit to make that suck _slightly_
less by at least requiring the next character to be a digit in order to match
the date and strip it off. It still doesn't handle filenames with a newline in
them, but... how would you?)

If this (now optional) date was the unix epoch (midnight, january 1, 1970, which
timezone adjustements often moved to December 31, 1969), it indicated we were
comparing against a nonexistent file. The more modern way to say this is to use
the special filename /dev/null. So if you want patch to create a file, what you
do is "diff -u /dev/null newfile", and if you want it to delete a file, "diff -u
oldfile /dev/null". (Otherwise it leaves a zero length file when you remove
allthe lines, or expects an empty file to already be there when adding with no
context lines.)

The other fun thing is when you diff 2 files, the files need to have different
names. How do you know which one you're applying the patch to? Historically, it
tried both names and used whichever one worked... but if you happen to have a
file with your tempname lying around in the directory you're applying the patch
_to_ (which happens a lot when you habitually use the same tempfile name), the
hunk may try to apply to the wrong file. (There were certain horrible heuristics
I don't remember that tried to work out what you _meant_ to do, which didn't
really help and I don't think I implemented them?)

And these days files have paths. As the switch from CVS to SVN (let alone git)
taught us: individual standalone files aren't very interesting, you're almost
always operating on a _tree_ of files.

So generally what you do _now_ (and what tools like svn or mercurial or git
pretend to do behind the scenes) is back up one directory, have two full trees
(the vanilla project and your modified version), and "diff -ruN" the two
subdirectories: -r is recursive, -u is unified format instead of the old < and >
version, and -N says pretend to compare new or removed files against /dev/null
so the diff says to add or remove them properly. That's why tools like svn or
mercurial or git will create diffs that start like:

  +++ a/path/to/file
  --- b/path/to/file

Except... now you've got an extra level of directory you don't want, so you have
to back up _out_ of your project's tree to apply the patch and it's STILL
guessing which name you mean.

So what you do is create the diffs like that, then use the "-p 1" option when
applying them, which says "peel off one layer of directory when parsing the
filenames". That removes the a/ and b/ from the paths, and the rest should be
identical so it's no longer ambiguous and it doesn't matter if you use the +++
or the --- line as the file to apply the patch to. (No, -p1 doesn't apply to the
magic name /dev/null, absolute paths aren't modified, only relative ones. Also,
you can say "-p0" to disable the above "certain horrible heuristics" on pathless
filenames and just literally use the filenames in the patch, but that doesn't
come up much these days. Creating a diff between two trees and applying it
within the top level of the tree via "patch -p1" is nearly universal now. That's
the format "git format-patch -1 $HASH" and "git am file.patch" are using, for
example.)

Ok, so all that's indicating what file hunks apply to, then you get to actual
hunks describing what changes to make within the file.  Each hunk starts with an
@@ line, with 4 numbers, like so:

  @@ -start,len +start,len @@ comment

Each "start" is the (decimal) line number in that file the hunk starts applying
at, and the "len" is the (decimal) number of lines described in that file. These
numbers measure the body of the hunk, which comes next.

(The "comment" part can be anything, and doesn't even have to be there. It's
ignored. Modern language-aware diff -u variants stick which C function you're
modifying in there, which is nice for humans but not used by patch that I know
of. This simple crappy heuristic there is "last unindented line", which can find
goto labels: ...)

Each line of the rest of the body of that hunk starts with one of three characters:

1) + meaning this line is only in the new version (it was added).
2) - meaning this line is only in the old version (it was removed).
3) " " (space) = this line is the same in both (it's context for the changes).

The context lines plus + lines need to add up to the "len" in the + part of the
@@ line, and the context lines plus - lines need to add up to the len in the -
part. (The start is more or less a comment, used to indicate how far off it
applies at if the hunk moved but otherwise not rally mattering as far as I can
tell. Well toybox doesn't use it.)

Note: if your code is tab indented, it still needs a space (ascii 32) at the
start of it to be a context line, then it's binary identical for the contents
(so tabs or spaces as appropriate). This causes some editors to flip out about
mixing tabs and spaces, but the distinction is functional here.

Patch opens files when it sees +++ --- line pairs, reads in the next @@ hunk and
the appropriate number of lines after it (with the right number of context
lines, additions, and removals for what the @@ line counts said), and then
searches in the file for a place where the appropriate context lines and removed
lines appear in the right order (removed lines are matched just like context, if
they're not there in the file the hunk doesn't apply), then replaces it with the
set of context lines and added lines the hunk says should go there instead.
(Note that if you patch -r then it's the + lines being removed and the - lines
being added, "reversing" the patch.)

Each hunk generally starts with 3 leading context lines, and end with 3 trailing
context lines, which generally provides enough context to uniquely identify
where to apply the hunk even if you're just adding a single line (that's the
pathological case of providing no other corroborating information). The
exception is when you're hunk applies at the start or end of the file: then
there aren't enough context lines, and may not be _any_ if you're right at the
end or beginning of the file.

The hunk also has interstitial context lines as appropriate (between the
additions and removals, which also have to match or the hunk won't apply), but
not more than 6 (leading + trailing context line count) or it'd split into 2
hunks. (This _does_ mean you can have 4 context lines in a row though.)

What IS important is that you have the same number of leading context lines as
trailing context lines, unless you're at the start/end of a file. If they don't,
it's not a valid hunk and patch barfs on the corrupted patch. And the number of
leading/trailing context lines not being the same means the patch program will
try to MATCH the start/end of the file (whichever one's got truncated context),
and fail if it can't (hunk does not apply, context is wrong).

You can have as many hunks as you want within a file, I.E as many @@ lines after
 a given --- +++ pair, but the hunks must apply in order, and this INCLUDES the
context lines. A line that's been "seen" as a trailing context line won't match
against the leading context of the next hunk.

Because of this, you sometimes need 3 or more interstitial context lines in a
row in the _middle_ of a hunk (between + and - lines), if that's how your
changes work out. A number of consecutive context lines matching the leading
context does NOT end the hunk, only consumig the line counts from the @@ line
does that. And then you figure out if leading/trailing context counts match
(indicating the need to match start/end of file) _after_ that. (If you really
want to back up and modify an earlier part of the file, you need a new --- +++
pair to flush and reopen the file, so it can start over searching at the beginning.)

Oh, I know I said the start numbers in the @ line were only used for warnings,
but you CAN use them to sanity check the leading context number if you want to.
(Since if you're forcing a match with the beginning of the hunk, it had better
start at 0 in that file or something is wrong.) Doesn't help with end of file
though.

So you wind up with:

--- filename
+++ filename
@@ -start,len +start,len @@
 context
 context
 context
-blah
+blah
 context
 context
 context
@@ -start,len +start,len @@
...

Oh, the - lines usually come before the + lines when they're on the same line,
but I don't think that's actually required? The entire context is matched before
applying the hunk anyway. And note that you don't skip what you've already
looked at when a hunk didn't apply, you go down ONE line and try matching again.
If your context lines are all blank, you can skip the start of where this hunk
applies otherwise. I hit and fixed that bug years ago in toybox. :)

And of course all this is before git added a "rename" syntax that looks like:

  https://lwn.net/Articles/244448/

And has copy and delete variants that allow it to be much less verbose (avoids
including the body of the matched file(s)).

It's on the todo list... :)

Rob

P.S. You asked.

> ps and i'm looking forward to the next mkroot, I miss Aborigonal!

Alas, I just landed back in Milwaukee to do another round of $DAYJOB because
neither toybox nor mkroot pay the bills. (I'm very grateful to the
https://patreon.com/landley subscribers, and it's great encouragement, but my
mortgage alone is like 25 times what that brings in. Nobody with a significant
budget wants to fund this work, and keeping the lights on gets scheduled higher
than things that don't. But I can presumably cut a mkroot release with the 4.20
kernel right after I do a toybox release at the end of the month. All 4.20 broke
that I've noticed so far was adding sha256 as a hard requirement to the s390x
build, and I can add that to the toybox airlock install passthroughs for the
moment...)

(I had a huge todo list for my month off... and wound up going limp for most of
it. I was doing ok until the battery in this old laptop completely died (as in
unplug = instant off, so suspend is useless and I lose all open windows every
time I move it. And alas I did NOT get Devuan working on the new System76 laptop
I ordered a few months back (binary wifi firmware tantrum in the installer), and
what they preinstalled on it has systemd, and given a choice between "system
with no battery" and "system with systemd" it's no contest. But I did get the
new She-Ra and Hilda watched, and the first season of The Good Place, so that's
something...)

Still Rob