[Toybox] [patch] add sed

Fri Aug 9 12:06:25 PDT 2013

On 08/08/2013 09:30:30 PM, Strake wrote:
> On 05/08/2013, Rob Landley <rob at landley.net> wrote:
> > I put a partial version in pending to indicate that I've got this  
> one,
> 
> Ah, sorry; I just noted that it wasn't working and started to write
> it. I'll leave it to you if you wish.

Well, I've had a stub there for a while and haven't done anything with  
it. I'm trying to balance "I can do this myself" with taking advantage  
of other people's work.

At the moment I've got "ping", "mount", and "umount" in a similarly  
pending state. I have a bit of downtime this month that I hope to use  
to catch up with, but it's not unlimited. (Gotta get a new contract to  
pay the bills...) And that's not even counting Mt. Pending directory.

I remember that sed is unfortunately subtle. Back on busybox I started  
with a presumably working version and then spent several weeks working  
out why various autoconf and automake things didn't work, and doing a  
largeish rewrite. And I also have to worry about musl's regex  
implementation possibly missing things people expect (the non-extended  
version doesn't do \| for example, that turns out to be a gnuism but  
the kernel build depends on it).

So what I could really use is a test suite. :)

(Right _now_ however, I'm finishing up a caffeine detox which has led  
to a lot of napping and playing skyrim. I try to go cold turkey on  
caffeine for a few weeks every year or so reset my tolerances, and it's  
been about 3 years since I made time to do it and I am _useless_ right  
now. I plan to start up again on monday, and tackle my todo list then.)

> >> -// Digested version of what sed commands can actually tell use to  
> do.
> >> +static wint_t xfgetwc (FILE *f) {
> >> +  wint_t x = fgetwc (f);
> >> +  if (x < 0) error_exit ("failed to read");
> >> +  return x;
> >> +}
> >
> > Some sort of wide character support, but not based on TOYBOX_I18N...
> 
> Is this to build in environments lacking the wc functions?

Partly. Partly because wc support is slower and more complex so the  
ability to switch it off is nice:

http://unixsedawk.blogspot.com/2013/07/alternative-to-grep.html

(Note that the gnu/dammit stuff has two codepaths, one for using wide  
chars and one for not using them. I prefer to do that at compile time.)

> > So xafgetswde() calls afgetswde() which calls afgetwswde() and  
> wcstoambs().
> 
> Should these be in another file?

Mostly I'd like to buy a vowel. :)

(Is that reference still current?)

I like short but descriptive names. I can't tell afgetswde and  
afgetwswde apart at a glance, and despite having read it at the time, I  
don't remember what wcstoambs does. (White character something to  
ambient something?) Not that I'm at my best at the moment...

> > And the non-wide character version is implemented as a wrapper  
> around the
> > wide character version.
> 
> This seemed simpler and easier. Could be wrong.

See "huge performance differences", above. That's pretty much the  
reason for having a non-wide-character-version in the first place.  
(Otherwise, the wide character version should be able to handle ascii  
input...)

That said, if your input is utf8 then your actual regex stuff mostly  
shouldn't care. You need to be able to stomp strings to lowercase to do  
case insensitive checks, and the [abcde] stuff has to be able to tell  
what a character is for matching, but other than that utf8 should just  
work. (No null bytes, no 7 bit ascii characters in the input confusing  
stuff. If you have invalid utf8 sequences in your regex or your input,  
you get to keep the pieces.)

*shrug* I'll look more closely at what you did when I'm more coherent.

> >> +typedef struct {
> >> +  char type;
> >> +  union {
> >> +    regex_t re;
> >> +    long n;
> >> +  };
> >> +} Address;
> >> +
> >> +typedef struct Function {
> >> +  char x;
> >> +  union {
> >> +    struct {
> >> +      char *writ;
> >> +      regex_t re;
> >> +      int flags, fd;
> >> +      long n, nMatch /* for s function, to pass to regexec */;
> >> +    };
> >> +    struct Function *fns;
> >> +  };
> >> +} Function;
> >
> > So this is a named struct _and_ a typedef?
> 
> Yes, to have pointers to own type.

struct blah {
   struct blah *next;
};

> > I haven't got any typedefs in the code. When it's a struct I say  
> struct.
> > I don't currently have a single enum in the cleaned up codebase:
> > When I need a constant, I #define it. I try fairly hard not to need
> > them.
> 
> Why?

Linux kernel coding style, chapter 5:

https://www.kernel.org/doc/Documentation/CodingStyle

> > Not extending the code that was there, just zapping it all and
> > replacing it with something completely unrelated.
> 
> Sorry, the longest-common-subsequence algorithm found this. It's
> tedious to make diffs myself...

It was mostly a comment that I didn't see any of the original code  
still there, it looked like you started over and were implementing a  
new sed without reference to the one that was there. (Deleting the help  
text was my first clue...)

> >> +static Command *cs;
> >> +
> >> +static long      parseNumber    (FILE *);
> >> +static Address   parseAddress   (FILE *);
> >> +static Function  parseFunction  (FILE *);
> >> +static Function *parseFunctions (FILE *);
> >> +static Command   parseCommand   (FILE *);
> >
> > Globals outside GLOBALS()
> 
> I get build failure if they are in GLOBALS:

If you need a pointer to a struct that's not in a header, use a void *  
and copy it to a local variable pointer of the right type to use it.

Most of those are function prototypes, though. Normally I can order my  
function declarations so as not to need to prototype anything. (When I  
can't, it's a sign my design's too complex.)

> unknown type name 'Command'
> field X declared as a function
> 
> >> +/* parse natural number */
> >> +static long parseNumber (FILE *f) {
> >> +  ...
> >> +}
> 
> > Sed reads a line of data into a buffer and operates on it. The posix
> > "getline()" can do this, and in posix 2008 can even allocate the  
> buffer
> > for you
> > ...
> > In a buffer, you can use strtoul() which tells you where the number  
> ended.
> 
> This is to parse the script, not the input.

My point was we shouldn't need a parseNumber() from FILE *, we should  
be able to read it into memory and use strtoul().

Rob
 1376075185.0