[Toybox] git.c progress

Sun Jan 1 14:13:22 PST 2023

For context: Moritz is teaching himself C by writing a git implementation for
toybox. I'm kind of impressed. :)

On 12/29/22 09:09, Moritz Christian Weber wrote:
> Hi Rob,
> I "solved" the issue, which made the git.c code not finding hashes and returned
> NULL (I used a modified/buggy bsearch() instead of the indented bsearchpos()).
> As a result it now clones Toybox 0.0.1 and 0.0.2. bsearchpos() never returns
> NULL, but a long of the position nearest to the hash. The drawback is that it
> can return wrong hashes, if it does not find the intended hash, but the good
> thing is that I can debug the wrong hashes in much better and interpretable way
> in git checkout, if the code does not crash before git checkout.
> 
> When cloning toybox 0.0.2, it shows unintended random bytes at the end of the
> recreated source files, which make the files too long. It indicated that the
> delta resolving is not correct yet and will result in wrong hashes. It is also
> confirmed by higher version than 0.0.3.
> 
> For version 0.0.3 and higher the git fetch "seems" to work "perfectly", but when
> checking out, some files are interpreted folders. I guess this comes from the
> files with unintended random bytes in the end, which result in wrong hashes,
> which bsearchpos() resolves the next similar hash. The good thing is that I know
> at this level which expected file/hash it was and how it should be resolved.
> Much better starting point to debug delta resolving than if you have to do this
> in fetch while delta resolving. The code feels also much more complete seeing
> how parts of the repository are already checked out.
> 
> I started a new email thread and attached you the current version for review.

I checked the historical versions in to have some history, and did several
simple cleanup passes on it for whitespace and such, and now I'm looking at:

struct IndexV2 {
...
  char (*sha1)[20];
...
}

static void read_index(struct IndexV2 *i)
{
  FILE *fpi;

  i = malloc(sizeof(i));
  i->sha1 = malloc(20*sizeof(char));

FYI toybox has an xmalloc() wrapper function in lib/xwrap.c that checks for NULL
return from malloc and just aborts with an error if so. (Because it should
basically never happen on a system with an MMU: it means you've run out of
VIRTUAL address space, which really isn't sanely recoverable. The virtual
address space starts out empty (it used to be a repeated copy-on-write mapping
of the zero page, and now it's more complicated but essentially the same thing).
The OS goes in and plugs physical pages into the mapping after the fact as they
get "dirtied" (I.E. written to), and that deferred allocation runs out of memory
the system will swap stuff out, and if the swap partition fills up and the
system actually runs Out Of Memory it invokes the OOM killer which sends the
kill signal to processes to free up memory. Linux people have complained about
the OOM killer for years because it kills their processes and they don't like
it, but if you run without one the system will lock up instead. They've added
buckets of heuristics to try to have ChatGPT levels of stupid pick which process
to kill in OOM situations: there's no right answer, so find it! And as a result,
modern Linux systems spend MUCH longer times swap thrashing before the OOM
killer gives up and lets the system lock up so everything stops you have to
reboot it. This is called progress, apparently. Systemd is adding an OOM killer
subsystem, because of course it is. Systemd is another project like chatgpt to
introduce subtle errors hidden by layers of complexity to prevent anyone from
seeing what specifically is wrong, except instead of descending from technology
to escape beysian spam filtering it's written by a guy who Linus explicitly and
publicly kicked out of linux-kernel, which he almost never does.)

Yes, I should do a video on that. :)

Back to the function above: the first malloc is taking the size of the pointer
instead of the size of what it points to, so you're allocating either 4 or 8
bytes instead of whatever IndexV2 adds up to. Which is probably going to
coincidentally work for a bit as you scribble over adjacent memory in the heap
somewhere, but Bad Things will happen at some point.

But it's the SECOND malloc I'm confused by, because it's sort of halfway to
function pointer syntax but I'm pretty sure that's not what you mean? Do you
want a char **sha1 there perhaps? (You want a pointer to a sized array?) Also,
sizeof(char) is 1 so you don't need to multiply by that...

What is the sha1 field for? (There's also packsha1 and indexsha1 and I dunno
which is which...)

Rob