[Toybox] [patch] add sed
Rob Landley
rob at landley.net
Fri Aug 9 12:06:25 PDT 2013
On 08/08/2013 09:30:30 PM, Strake wrote:
> On 05/08/2013, Rob Landley <rob at landley.net> wrote:
> > I put a partial version in pending to indicate that I've got this
> one,
>
> Ah, sorry; I just noted that it wasn't working and started to write
> it. I'll leave it to you if you wish.
Well, I've had a stub there for a while and haven't done anything with
it. I'm trying to balance "I can do this myself" with taking advantage
of other people's work.
At the moment I've got "ping", "mount", and "umount" in a similarly
pending state. I have a bit of downtime this month that I hope to use
to catch up with, but it's not unlimited. (Gotta get a new contract to
pay the bills...) And that's not even counting Mt. Pending directory.
I remember that sed is unfortunately subtle. Back on busybox I started
with a presumably working version and then spent several weeks working
out why various autoconf and automake things didn't work, and doing a
largeish rewrite. And I also have to worry about musl's regex
implementation possibly missing things people expect (the non-extended
version doesn't do \| for example, that turns out to be a gnuism but
the kernel build depends on it).
So what I could really use is a test suite. :)
(Right _now_ however, I'm finishing up a caffeine detox which has led
to a lot of napping and playing skyrim. I try to go cold turkey on
caffeine for a few weeks every year or so reset my tolerances, and it's
been about 3 years since I made time to do it and I am _useless_ right
now. I plan to start up again on monday, and tackle my todo list then.)
> >> -// Digested version of what sed commands can actually tell use to
> do.
> >> +static wint_t xfgetwc (FILE *f) {
> >> + wint_t x = fgetwc (f);
> >> + if (x < 0) error_exit ("failed to read");
> >> + return x;
> >> +}
> >
> > Some sort of wide character support, but not based on TOYBOX_I18N...
>
> Is this to build in environments lacking the wc functions?
Partly. Partly because wc support is slower and more complex so the
ability to switch it off is nice:
http://unixsedawk.blogspot.com/2013/07/alternative-to-grep.html
(Note that the gnu/dammit stuff has two codepaths, one for using wide
chars and one for not using them. I prefer to do that at compile time.)
> > So xafgetswde() calls afgetswde() which calls afgetwswde() and
> wcstoambs().
>
> Should these be in another file?
Mostly I'd like to buy a vowel. :)
(Is that reference still current?)
I like short but descriptive names. I can't tell afgetswde and
afgetwswde apart at a glance, and despite having read it at the time, I
don't remember what wcstoambs does. (White character something to
ambient something?) Not that I'm at my best at the moment...
> > And the non-wide character version is implemented as a wrapper
> around the
> > wide character version.
>
> This seemed simpler and easier. Could be wrong.
See "huge performance differences", above. That's pretty much the
reason for having a non-wide-character-version in the first place.
(Otherwise, the wide character version should be able to handle ascii
input...)
That said, if your input is utf8 then your actual regex stuff mostly
shouldn't care. You need to be able to stomp strings to lowercase to do
case insensitive checks, and the [abcde] stuff has to be able to tell
what a character is for matching, but other than that utf8 should just
work. (No null bytes, no 7 bit ascii characters in the input confusing
stuff. If you have invalid utf8 sequences in your regex or your input,
you get to keep the pieces.)
*shrug* I'll look more closely at what you did when I'm more coherent.
> >> +typedef struct {
> >> + char type;
> >> + union {
> >> + regex_t re;
> >> + long n;
> >> + };
> >> +} Address;
> >> +
> >> +typedef struct Function {
> >> + char x;
> >> + union {
> >> + struct {
> >> + char *writ;
> >> + regex_t re;
> >> + int flags, fd;
> >> + long n, nMatch /* for s function, to pass to regexec */;
> >> + };
> >> + struct Function *fns;
> >> + };
> >> +} Function;
> >
> > So this is a named struct _and_ a typedef?
>
> Yes, to have pointers to own type.
struct blah {
struct blah *next;
};
> > I haven't got any typedefs in the code. When it's a struct I say
> struct.
> > I don't currently have a single enum in the cleaned up codebase:
> > When I need a constant, I #define it. I try fairly hard not to need
> > them.
>
> Why?
Linux kernel coding style, chapter 5:
https://www.kernel.org/doc/Documentation/CodingStyle
> > Not extending the code that was there, just zapping it all and
> > replacing it with something completely unrelated.
>
> Sorry, the longest-common-subsequence algorithm found this. It's
> tedious to make diffs myself...
It was mostly a comment that I didn't see any of the original code
still there, it looked like you started over and were implementing a
new sed without reference to the one that was there. (Deleting the help
text was my first clue...)
> >> +static Command *cs;
> >> +
> >> +static long parseNumber (FILE *);
> >> +static Address parseAddress (FILE *);
> >> +static Function parseFunction (FILE *);
> >> +static Function *parseFunctions (FILE *);
> >> +static Command parseCommand (FILE *);
> >
> > Globals outside GLOBALS()
>
> I get build failure if they are in GLOBALS:
If you need a pointer to a struct that's not in a header, use a void *
and copy it to a local variable pointer of the right type to use it.
Most of those are function prototypes, though. Normally I can order my
function declarations so as not to need to prototype anything. (When I
can't, it's a sign my design's too complex.)
> unknown type name 'Command'
> field X declared as a function
>
> >> +/* parse natural number */
> >> +static long parseNumber (FILE *f) {
> >> + ...
> >> +}
>
> > Sed reads a line of data into a buffer and operates on it. The posix
> > "getline()" can do this, and in posix 2008 can even allocate the
> buffer
> > for you
> > ...
> > In a buffer, you can use strtoul() which tells you where the number
> ended.
>
> This is to parse the script, not the input.
My point was we shouldn't need a parseNumber() from FILE *, we should
be able to read it into memory and use strtoul().
Rob
1376075185.0
More information about the Toybox
mailing list