[Toybox] [PATCH] Start replacing get_line() with getline().

enh enh at google.com
Mon Jul 22 16:28:24 PDT 2019


On Mon, Jul 22, 2019 at 2:44 PM Rob Landley <rob at landley.net> wrote:
>
> On 7/22/19 10:17 AM, enh via Toybox wrote:
> > thoughts?
>
> Sorry, laptop battery ran out a few days ago (got caught inside by a
> thunderstorm and couldn't get back to my charger before the battery drained
> enough the Dell bios woke up from suspend and resumed-to-nothing), and
> thunderbird doesn't remember  open reply windows the way kmail used to.
>
> > On Fri, Jul 19, 2019 at 9:55 AM enh <enh at google.com> wrote:
> >>
> >> I started this last night, but thought I'd aim to send multiple small
> >> patches rather than work through all the callers and send one big patch.
>
> This has also been on my todo list, it's just big and touches a lot of files
> some of which are dirty in my tree, so I hadn't gotten all the way through it
> yet. :)
>
> >> I've deliberately chosen the ugly name `allocated_length` because we've
> >> had historical bugs where folks think this a line length in the sense of
> >> the return value. I do wonder whether we should actually have some kind
> >> of getline() wrapper that hides the `char *`/`size_t` pair in lib/,
>
> As long as we're changing things anyway, a wrapper that simplifies the API would
> probably make sense? Just always malloc it and return the pointer. Let's see, is:
>
>   char *blah = xgetline(fp);
>
> enough info? For the /n version it is (we can strlen), for the delim version we
> want the returned length. and it's a minor optimization to _have_ the length...
>
>   char *blah = xgetline(FILE *fp, unsigned *len);
>
> Where if len is NULL it doesn't try to set it, so the "simple call is
> s = xgetline(fp, 0);
>
> And then it can return NUL for EOF. (The downside of this is it wouldn't report
> read errors, although you can always zero and check errno. In theory we could
> use int *len and return -1 but... why? Returning null for EOF on a read error is
> right 95% of the time anyway, and the function itself could error_msg() if we
> really wanted it to, but we probably don't?)
>
> On my todo list this is entangled with "make the data member of double_list a
> void * instead of a char *" because both design decisions were fallout from
> implementing patch.c back in the day. (Although the double_list thing also
> involves turning dlist_add_nomalloc into just dlist_add and then making the
> current behavior dlist_add_wrap()... possibly struct double_list should _just_
> have the prev and next pointers, and then it can be a first member of subclassed
> double lists and we can dlist_add(&list, &new->dl);
>
> Ahem. I have a whole lot of polishing work I _could_ do if I wasn't busy with
> everything else. :)
>
> >> which makes the function easier to use in most cases but does add the
> >> less common gotcha that you wouldn't be able to getline() through
> >> multiple files at once (which does happen in at least one toy).
> >>
> >> But maybe the real fix is to look harder for places where we can just
> >> use loopfiles_lines?
>
> Also worth doing, yes. Especially since I want to make the plumbing for
> loopfiles_lines() try to read larger blocks and chop them up itself.
>
> >> Speaking of which, should we actually add two more
> >> arguments to that? Specifically: switch it to getdelim() rather than
> >> getline() behind the scenes, and also add a way to have the trailing
> >> '\n' automatically removed, since that seems to be what most callers
> >> want?
>
> Yes, the xgetline() above should NUL out the \n before returning the string.
>
> There should be a getdelim version, but NUL-ing out the delimiter still makes
> sense there. (And nulling out a NUL is a NOP so shouldn't hurt anything.)
> ALTHOUGH, it's possible the delimiter could be a string, allowing us to have
> utf8 delimiters? (Which makes NUL delimiters awkward: should it treat a NULL
> pointer as meaning delimit with NUL, should it treat an empty string that way,
> should the delimiter have a length...?)
>
>   char *s = xgetdelim(FILE *fp, char *delim, int delimlen, int *len);
>
> Eh, it's not _too_ awkward. But is there ever a case where we'd want a delimiter
> with a NUL _in_ it? (Probably not?) In which case the above is xgetdelim_utf8()
> and then we'd have an:
>
>   char *s = xgetdelim(FILE *fp, char delim, int *len);
>
> Wrapper to call it. And then xgetline() could also be a wrapepr for
> xgetdelim_utf8(), and _that_ would be the function doing the fancy "read 64k at
> at a time and chop lines out of it to xstrndup(), calling xrealloc() as
> necessary when you cross a block boundary or hit a crazy long one". AND we could
> put a ceiling on line length if we wanted to, although... do we want to? Under
> what circumstances? (Query available memory? What does ulimit -d actually limit
> again? The problem is malloc() only allocating _virtual_ memory and then the OOM
> killer telling you when you've run out of physical memory, often long after swap
> thrashing...)
>
> Ahem. There's a reason this hairball is still on the todo list. :)
>
> >> Anyway, that seemed like enough questions that it was time to send this
> >> initial patch out before doing too much more...
> >> ---
> >>  lib/password.c        |  9 +++++----
> >>  toys/other/rev.c      | 26 ++++++++++++--------------
> >>  toys/posix/uudecode.c | 21 +++++++++++++--------
> >>  3 files changed, 30 insertions(+), 26 deletions(-)
>
> Applied.
>
> Longer-term the input batching is a reasonable performance win, something you
> care about in grep. And that rewrites xgetline() and xgetdelim() wrappers around
> a new function anyway, and as long as we're doing that we may as well change the
> API to a) always return a malloced string, b) optionally return the length of
> the string...

i'm not sure what you mean here. the underlying getdelim()/getline()
does this (via the buffer in the FILE*). it's the get_line() that
we're removing that _doesn't_ buffer input. switching to getline() [or
loopfiles_lines, which uses it] gets you all the way to where you want
to be from that perspective.

> Ah, wait. There ARE times we care about the delimiter being there or not: the
> last line isn't guaranteed to have it and there are subtle behavior changes in
> sed and such (which I implemented!) based on this. (It's related to the
> sed-of-a-binary thing, you don't want to add a NUL to the end of the file if
> there wasn't one...)

on the [i think valid] assumption that we only need '\n' or '\0' as
delimiters, a flag for that and a flag for whether or not to keep the
delimiter in loopfiles_lines seems like it would be more than enough
to be getting on with? better than where we are today, anyway.

apart from the fact a few locals need to be promoted to [TT.] globals,
loopfile_lines() seems strictly like an improvement for code currently
using getline() too? the attached patch switches over nl.c, for
example, and it's less code and a larger scope for the reuse of the
getline() buffer (even ignoring the pessimal use of getline() in nl.c
that allocated+deallocated a new buffer per line).

[PATCH] nl: switch from getline() to loopfiles_lines().

This was basically just to help me think out loud while discussing
loopfiles_lines().
---
 toys/posix/nl.c | 46 ++++++++++++++++++----------------------------
 1 file changed, 18 insertions(+), 28 deletions(-)

> Grrr. <marvin the martian voice>design, design...</marvin the martian voice>
>
> Rob
>
> (This is why I didn't immediately reply to the first email.)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-nl-switch-from-getline-to-loopfiles_lines.patch
Type: text/x-patch
Size: 2317 bytes
Desc: not available
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20190722/7505a4f1/attachment-0002.bin>


More information about the Toybox mailing list