[Toybox] More expand cleanups

Tue Nov 27 23:51:57 PST 2012

I don't like adding new types to global headers which are only used in  
one command, so change global tablist to be a void *... then stop and  
think. A linked list of integers is a bit inefficient, shouldn't this  
just be an array?

We have toybuf, which we can trivially fill with an array of up to 1024  
ints. Is that enough for "expand"? Would 2048 "short" (or 512 long  
long) be better? Does having a finite amount of these (which is way  
more than anybody's screen size) violate any specification anywhere?

I hardwired in a similar limitation to the number of columns in ls, but  
if somebody did have an insane display size combined with a huge  
directory it would still display all the contents, just in no more than  
a few hundred columns. In that case it was acceptable. Is expand used  
for some kind of industrial purpose having nothing to do with human  
readable (ballpark 80 column) screens?

If having a maximum tablist size _is_ a problem we can malloc a buffer  
of the right size, but right now you physically _can't_ have more than  
65k input entries on linux due to process environment limitations (32  
times 4k pages of command line text, if each one is a single character  
that's 65k), so there's already an implicit limit, we're just making it  
smaller but still "sufficiently huge".

So I'm leaning towards filling out toybuf and giving the bad tablist  
the error message if it's more than 1024 entries in the list.

So change the global from "tablist" to "tablen".

Also, in the help text I'm tempted to just say "a comma separated list"  
instead of having comma or space awkwardly explained in the  
paranethetical. Yes we should support space, but since "expand -n 1 2  
3" treats 2 and 3 as filenames, you need quotes around it to use spaces  
as a separator which is just awkward. If they're going to read the docs  
to see how to use it, comma is fine. If an existing script already  
provides spaces, we work anyway... Wordsmithing a bit.

Also, the gnu/dammit version accepts more than one -t:

echo -e "one\ttwo\tthree\tfour" | expand -t 10,18 -t 25

So it concatenates those, which means the ascii argument list should be  
t*, not t: and the TT entry is an arg_list.

So, convert argument to t*, convert global type to struct arg_list  
*tabs, add int tabcount to keep track of how many unsigned toybuf[]  
entries parsing filled out, redo build_tablist based on iterating  
through TT.tabs, sub-iterating through each string within that, and  
incrementing TT.tabcount, and while I'm at it move the -t parsing to  
the start of expand_main() because there's not much point having a  
separate function that's called from exactly one place, might as well  
just do it inline.

Now let's look at expand_file(). Collate all same declaration types, so  
just one line of ssize_t declarations instead of several... and why are  
we using ssize_t for something that it's a file position offset?

Ah, you're using toybuf already, for file input. Reasonable choice. Ok,  
I'll break the parsing logic out into a function that returns the  
number of entries and takes the array to write into as its argument,  
and have it not write anything to the array if the argument is NULL.  
Then I can call it twice, once to count the number of entries and the  
second time to fill out the array.

Back to expand_file(). The downside of using readall() is that  
interactive granularity goes way down. I had this problem with "tee"  
once upon a time, it meant that piping the output of anything through  
tee made it appear in 4k chunks, which meant if you logged the result  
of a build you couldn't really see what the build was doing. I'm not  
sure expand has the same use cases, but that's why I did xread().

Also, I had an argument with dalias (of musl) a while back, who  
insisted that read() no longer returns 0 length reads when interrupted  
by sigstop. So you can check for return 0 without also checking for  
-EINTR now, I think? Anyway, I'll leave readall() for now.

STDOUT_FILENO is 1. Has been since 1969. Just use 1. Although stdout is  
a FILE * which means it's got built-in output buffering, so if we use  
printf() and friends we don't have to worry about output buffering, and  
in fact we can use the printf("%*c", len, char) thing to do the padding  
for us with a single codepath... Except that won't work for a trailing  
tab at the end of file with no newline after it. Ok, cheat:
   printf("%*c", len-1, ' ');

Ah, hang on. Internationalization. This thing is going to need  
multibyte support for utf8, isn't it? (The same general logic as wc -m.  
Hmmm, I wonder if they can share code?)

Ok, I'll have to come back to this in the morning.

Rob
 1354089117.0