[Toybox] Has anybody ever actually used cut -f?

Rob Landley rob at landley.net
Thu Sep 1 14:23:01 PDT 2016


On 09/01/2016 03:29 PM, Samuel Holland wrote:
> Hello,
> 
> On 09/01/2016 02:58 PM, Rob Landley wrote:
>> In theory:
>>
>> echo "one two three four five" | cut -f 2-4
>>
>> Should be really useful, and mean you don't need awk. In practice,
>> posix specifies that the default separator of cut -f is TAB, and that
>> the -d delimiter specifier has no way to specify 'arbitrary run of
>> whitespace'.
> 
> Yes, I use cut all the time. In fact, I have never intentionally used
> awk on my own--only when copied from somebody else's one-liners.
> Usually if there's a variable run of space I cut on the punctuation next
> to it, or failing that, pipe through `sed 's/\s\+/\t/g'`. Of course,
> this probably defeats the whole advantage of using cut over awk
> (simplicity), but it's habit at this point.

Uh-huh.

>> So I propose 2 changes to toybox cut:
>>
>> 1) -d "" means arbitrary run of whitespace.
>>
>> 2) It's the default.
> 
> I'm sure people besides me use `cut -f`, but I also assume they use -d.

I checked cut.test right after I sent that message, and every -f test
also supplies -d.

> So changing the default delimiter to arbitrary whitespace shouldn't be a
> problem...

Modulo the existing cut matching single characters, and this matching
_runs_ of characters.

But you've gotta throw that out a bit to support UTF8, so...

> I tried to search GitHub, but they broke global code search;

As Google Code did before them.

> Google got me to https://github.com/stephenturner/oneliners and
> https://gist.github.com/j3tm0t0/4122817 which apparently don't use -d.
> On the other hand, I see a lot of instances of -d " " which would be
> simplified by the proposed change.

Yes and no.

echo "one  two   three" | cut -d " " -f 2,3

The answer is " two" with a space before it, and 3 not showing up at all
because it would be between spaces.

I'm not saying that behavior's more _useful_ (it isn't), I'm just saying
it's different from -d defaulting to a run of whitespace. Still, -d " "
would still do the same thing if explicitly supplied, so that's not a
behavior change.

>> As has been noted before, this makes about 90% of the uses of awk go
>> away. The downside is, if you're _not_ using toybox cut, it won't
>> work.
>>
>> Any opinions?
> 
> If you want to avoid breaking existing code, but make cut more useful,
> accept multiple characters for -d and match any of them.

Needing to supply -d run-of-whitespace every time using double quotes
(not single quotes) puts it up about with awk in terms of awkwardness to
use (which requires single quotes, not double quotes). And awk was there
first.

> Then at least
> you could do cut -d "$IFS" or similar if you don't know if the output is
> spaces or tabs.

Or I could have multichar delimiters be -d "abc" meaning
"armadilloabcbroccoliconfetti" could be split into broccoli, armadillo,
and confetti.

> This got me thinking, since \n is in $IFS...
> 
> $ printf "1234\n5678\n\n90\n" | cut -s -f2 -d$'\n'
> 5678

Cut is defined (by posix) as reading lines,  which are delimited with
\n, and presumably that happens before it looks for other delimiters
_within_ the line. Sowhat you're trying to do is facially nuts and I'd
want to see real code depending on it before looking further.

Presumably it's doing the same "read a large block of text, match, and
then bck up and find line boundaries" trick their grep is?

> $ printf "1234\n5678\n\n90\n" | cut -f2 -d$'\n'
> 5678
> $ printf "1234\n5678\n\n90\n" | ./toybox cut -s -f2 -d$'\n'
> $ printf "1234\n5678\n\n90\n" | ./toybox cut -f2 -d$'\n'
> 1234
> 5678

I never promised bug-for-bug compatibility. That one is not "reading
lines". Mine is.

> 90
> $
> 
> I'm not sure what to make of that.

You asked for something crazy, and they did something crazy. Whether the
two match up is a matter of opinion.

>> Rob
> 
> Samuel

Rob



More information about the Toybox mailing list