[Toybox] toybox git: Seeking advise to refactor some functionality into lib.c

Sat Apr 9 11:46:40 PDT 2022

On 4/9/22 11:36, Moritz Christian Weber wrote:
> Hi,
> the previous mention git (clone) toy is making progress, even as I struggled a
> lot with my bad C knowledge.

I look forward to this. Learning is part of the process. :)

> I have a 300 LOC multi-command scaffold, which
> works in most parts.

I'm curious about your design. I'd been thinking of reusing the existing
plumbing in main.c for this.

If git.c is a single C file then I'd probably split toy_find() into two
functions, with a new toy_find_list() function for toy_find to call, taking the
list to search and its length as arguments. (The "adjust to skip the first entry
of toy_list because the toybox multiplexer is special cased" should stay in the
toy_find() wrapper that operates on the global list.)

Then git.c could have a function that declares its own toy_list:

  struct toy_list *find, git_list[] = {
    // Note: keep this alphebetized for binary search
    {"help", git_help_main, "<1>1agimw", 0},
    {"pull", git_pull_main, "(recurse-submodules): qv", 0},
    ...
  };

  if (!(find = toy_find_list(*toys.optargs, git_list, ARRAY_LEN(git_list)))
    error_exit("no %s", name);
  toy_init(find, toys.optargs+1);

Hmmm, I really need to work out how to handle --option and --no-option in the
argument stuff (git pull is full of them). That's an extra prefix punctuation in
the optstr I expect (haven't used "+" yet), and a round of lib/args.c
modification to do cleanly. Or maybe not PREFIX, maybe it's a new trailing
punctuation on a longopt saying this could be a --no- variant and a second
bitfield tracking the "no"s we've seen? Given that a command line can
theoretically have --thingy --no-thingy --thingy it would have to be aware and
toggle somehow... Still need the second bitfield so the caller can use it to
switch off DEFAULT values...

  toys.optflags |= DEFAULTS;
  toys.optflags &= ~toys.nopeflags;

Anyway, I look forward to seeing what you've done. Not meaning to overwhelm you,
a lot of cans of worms can stay closed for a while longer. (Wait for somebody to
complain...)

Fred Brooks had a lovely line in The Mythical Man-Month about how you're
probably going to throw away your first implementation, but if you PLAN on doing
this you will definitely throw TWO implementations away. The insight is that you
don't properly understand a problem until after you've solved it and tried to
use the solution to do real stuff. And then the x11 guys wrote about the "second
system effect", where after you've done a working prototype implementation
that's taken off and developed a userbase, there's a strong temptation to start
over and DO IT RIGHT with a new one that will conclusively solve every problem
for everybody... meaning you hugely overreach and do a lot of "infrastructure in
search of a user" and make a bloated mess.

And then if you're REALLY lucky you get a third system like Unix, in which
survivors feeling from the collapse of a second system (in Unix' case Multics,
which was a second system response to MIT's CTSS and had 36,000 pages of design
documents before the hardware arrived; the project collapsed under its own
weight basically immediately) started over being MINIMAL this time. Having
learned all the second system lessons about what NOT to do, they were asking
what can I now AVOID doing to make a small and simple system that does only what
it needs to?

(This whole plot arc is one of the many talk ideas I never got around to
properly organizing and giving at a conference...)

> It supports also git pack indexing as it make things easier
> to handle, safes the lifetime of SDCards on embedded systems (to avoid a
> decompression of all pack objects and then removing all objects, which are not
> touched by checkout) and opens the option to extend the toy for further commands.
> 
> Anyways, I am a little stuck now in the progress due to 3 mandatory library
> refactorings for git fetch:
> 
> 1) zlib inflate() is my biggest pain point.

I'm currently poking at that to add gzip and deflate support to wget.

> The git pack format indicates the
> length of each zlib compressed object in the pack file by the size of
> UNCOMPRESSED chars after decompression.

Most likely it calculates it before compressing the data. (If I recall, the
header and body were compressed as a single block?)

> Consequently, git fetch expects to read
> a zlib compressed object as long as it did not read more than the announced
> number of uncompressed char and then expects feedback from inflate up to which
> compressed char/position/offset it read (file pointer/position/offset). After
> that it reads the next object type, uncompressed object length and compressed
> objects data and so on. I guess it is some optimization for sequential write of
> the pack format, when compressing.
> 
> I am not sure, if I can get this behavior out of inflate() in deflate.c by
> initializing deflate and bitbuf in a special way. Or if I need another inflate
> function signature. Any kind of advise would be helpful.

I'm not sure I understand. Do you need a function that something like:

  int deflate_read(char *buf, unsigned length, unsigned &used);

Which sets used to the number of bytes read from the input...

Except the problem there is the input isn't specified (filehandle, memory
buffer), and "used" is going to be a number of BITS not bytes. and the next read
may not start on a byte boundary if the output was stopped suddenly (it needs
the deflate state to resume writing into the next buffer).

I'd have thought that "here's this filehandle to read uncompressed data from"
could be passed off to the next thing that needs to read more uncompressed data
from after the previous data? (I vaguely recall deflate marks end-of-data so it
can tell provide EOF to the filehandle. When you resume from there, do you need
the next compressed or next uncompressed byte after the previous compressed
block? I guess uncompressed is more flexible and then we can feed it through the
decompressor again ourselves if we need to.)

The problem is, fundamentally what you've got it in two pieces: "here's X bytes
left in the buffer" plus "now read more from the original input". I kinda of
want to know what the consumer looks like before trying to smooth that over...

> 2) sha1sum(): I need a sha1sum() to hash decompressed git objects for indexing.
> md5sum.c seems to have the functionality, but again with focus to file
> descriptors. Any indication how to reuse sha1sum in another toy?

I suppose I can move the hash plumbing into lib? Although possibly it belongs in
portability.c because of that library config stuff to use assembly optimized
version du jour. And moving it raises the question of whether sha256sum and sha3
should be in the library too... Sigh.

A command reaching out sideways and using another command's plumbing is... not
encouraged. Although toysh does it for some shell builtins. But toysh either has
to be ok with "this command is switched off in menuconfig" or force the config
with dependencies. (Note that toysh is special cased in scripts/single.sh.) I
suppose git could have "depends" lines in its config. Or "selects"...

Anyway, generally speaking the three "clean" options are:

A) Run it as a child process, ala xpopen("command") or similar. There's xrun()
and toys/*/mount.c has a tortoise() function that should probably be genericized
somehow. (It's on the todo list.

B) Bundle multiple commands into a single .c file, although that can go
overboard (with ps.c being the prime example of "probably too much").

C) Move the plumbing in question to lib/ like I did with deflate.c a while back.
(And am revisiting for wget.c now...)

> 3) wget_main(): I currently use execv()  both required git fetch http requests
> wget.c and read the files afterwards. It works (except from the fact that file
> contains 8 char of git protocol response too), but I would be more elegant to
> have parts of wget_main() in the library for reusage. Piping would be also an
> option, but less elegant than a function. Any indication which parts and
> function signature for wget.c could/should be refactored to the library? I would
> like to http get/post and get a response buffer/char* back.

I'm working on this now. Let me promote it first, and then I can see if it needs
to go in lib...

Could you explain "8 bytes of git protocol response"? (There's a git:// protocol
and an https:// protocol and I never really read up on either...) Does this also
need to read a known number of bytes and then pass off the filehandle to a
second consumer without overshooting?

> Best regards,
> Moritz

Rob