[Toybox] git.c progress

Mon Jan 2 14:20:15 PST 2023

Hi,
It seem I CCed the mailing list accidentally ;-). The git toy is most
likely not the easiest toy to start learning C coming from memory managed
languages, but the 90MB of dependencies of git give an incredible
motivation to do so :-).

Thank for reviewing read_index(). I will include xmalloc() and check what I
am actually pointing to. Read_index is currently a dead and used function.
I scaffolded it to get an idea of deserialization in C. It will be more
useful once git clone serializes the index correct.

Regarding the different kind of SHA fields: 1) sha1 field is a dynamic
array storing the values of the "git index hash table". I did not invent
this. This struct shall represent the actually documented git idx format.
The git "hash table" reads the first char of the sha1 hash
(FanOutTable(FOT)) and then binary searches the bucket for the requested
hash. 2) packsha1 field is the sha1 hash of the pack file which the idx
struct/file is created for 3) idxsha1 is the sha1 hash of the overall
struct including the packsha1. The last two sha1 field can only be
calculated correct, when git clone works correct. So I dont care much about
this hashes right now. git uses them most likely to detect bit flips in
downloaded backs and calculated indexes. The git pack format V2 is
documented here:
https://github.com/git/git/blob/master/Documentation/gitformat-pack.txt#L266
(like was broken due to renamings in the git/git repo. Fixed the links over
the weekend.)

Best regards,
Moritz

P.S.: It also corrected git fetch over the weekend, so that I read the
actually master head from the remote repo and persists it into HEAD, so
that I can get rid of the static version tag hashes in the code.

On Sun, Jan 1, 2023 at 11:01 PM Rob Landley <rob at landley.net> wrote:

> For context: Moritz is teaching himself C by writing a git implementation
> for
> toybox. I'm kind of impressed. :)
>
> On 12/29/22 09:09, Moritz Christian Weber wrote:
> > Hi Rob,
> > I "solved" the issue, which made the git.c code not finding hashes and
> returned
> > NULL (I used a modified/buggy bsearch() instead of the indented
> bsearchpos()).
> > As a result it now clones Toybox 0.0.1 and 0.0.2. bsearchpos() never
> returns
> > NULL, but a long of the position nearest to the hash. The drawback is
> that it
> > can return wrong hashes, if it does not find the intended hash, but the
> good
> > thing is that I can debug the wrong hashes in much better and
> interpretable way
> > in git checkout, if the code does not crash before git checkout.
> >
> > When cloning toybox 0.0.2, it shows unintended random bytes at the end
> of the
> > recreated source files, which make the files too long. It indicated that
> the
> > delta resolving is not correct yet and will result in wrong hashes. It
> is also
> > confirmed by higher version than 0.0.3.
> >
> > For version 0.0.3 and higher the git fetch "seems" to work "perfectly",
> but when
> > checking out, some files are interpreted folders. I guess this comes
> from the
> > files with unintended random bytes in the end, which result in wrong
> hashes,
> > which bsearchpos() resolves the next similar hash. The good thing is
> that I know
> > at this level which expected file/hash it was and how it should be
> resolved.
> > Much better starting point to debug delta resolving than if you have to
> do this
> > in fetch while delta resolving. The code feels also much more complete
> seeing
> > how parts of the repository are already checked out.
> >
> > I started a new email thread and attached you the current version for
> review.
>
> I checked the historical versions in to have some history, and did several
> simple cleanup passes on it for whitespace and such, and now I'm looking
> at:
>
> struct IndexV2 {
> ...
>   char (*sha1)[20];
> ...
> }
>
> static void read_index(struct IndexV2 *i)
> {
>   FILE *fpi;
>
>   i = malloc(sizeof(i));
>   i->sha1 = malloc(20*sizeof(char));
>
> FYI toybox has an xmalloc() wrapper function in lib/xwrap.c that checks
> for NULL
> return from malloc and just aborts with an error if so. (Because it should
> basically never happen on a system with an MMU: it means you've run out of
> VIRTUAL address space, which really isn't sanely recoverable. The virtual
> address space starts out empty (it used to be a repeated copy-on-write
> mapping
> of the zero page, and now it's more complicated but essentially the same
> thing).
> The OS goes in and plugs physical pages into the mapping after the fact as
> they
> get "dirtied" (I.E. written to), and that deferred allocation runs out of
> memory
> the system will swap stuff out, and if the swap partition fills up and the
> system actually runs Out Of Memory it invokes the OOM killer which sends
> the
> kill signal to processes to free up memory. Linux people have complained
> about
> the OOM killer for years because it kills their processes and they don't
> like
> it, but if you run without one the system will lock up instead. They've
> added
> buckets of heuristics to try to have ChatGPT levels of stupid pick which
> process
> to kill in OOM situations: there's no right answer, so find it! And as a
> result,
> modern Linux systems spend MUCH longer times swap thrashing before the OOM
> killer gives up and lets the system lock up so everything stops you have to
> reboot it. This is called progress, apparently. Systemd is adding an OOM
> killer
> subsystem, because of course it is. Systemd is another project like
> chatgpt to
> introduce subtle errors hidden by layers of complexity to prevent anyone
> from
> seeing what specifically is wrong, except instead of descending from
> technology
> to escape beysian spam filtering it's written by a guy who Linus
> explicitly and
> publicly kicked out of linux-kernel, which he almost never does.)
>
> Yes, I should do a video on that. :)
>
> Back to the function above: the first malloc is taking the size of the
> pointer
> instead of the size of what it points to, so you're allocating either 4 or
> 8
> bytes instead of whatever IndexV2 adds up to. Which is probably going to
> coincidentally work for a bit as you scribble over adjacent memory in the
> heap
> somewhere, but Bad Things will happen at some point.
>
> But it's the SECOND malloc I'm confused by, because it's sort of halfway to
> function pointer syntax but I'm pretty sure that's not what you mean? Do
> you
> want a char **sha1 there perhaps? (You want a pointer to a sized array?)
> Also,
> sizeof(char) is 1 so you don't need to multiply by that...
>
> What is the sha1 field for? (There's also packsha1 and indexsha1 and I
> dunno
> which is which...)
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20230102/661e85e8/attachment.htm>