[Toybox] Working on a [currently lame] downloader

Tue Jul 14 00:25:33 PDT 2015

On Tue, Jul 14, 2015 at 12:21:28AM -0500, Rob Landley wrote:
> On 07/13/2015 01:18 AM, Isaac Dunham wrote:
> > Hello,
> > I've been working on an HTTP(S) downloader (what wget does, but currently
> > completely incompatible with everything) for toybox.
> > Currently it works to some degree, so I thought I'd mention that it's
> > in progress, ask for a general idea of what's desired, and give people
> > an idea of how completely lame it is right now and how I'm doing it.
> > 
> > I presume that the agenda for toybox is implementing some subset of wget
> > in a compatible manner; is "what busybox wget supports + SSL" a rough
> > approximation of the desired functionality?
> 
> Yup.
> 
> > I mentioned that it's HTTP(S); it fetches files over SSL without
> > implementing SSL. I cheated on networking: it calls netcat or 
> > openssl s_client -quiet -connect.
> 
> Huh, I didn't know about that. Cool.

I actually found out about that from the busybox popmaildir help.

> > It uses an approach roughly similar to xpopen_both(), except that
> > it uses socketpair() instead of pipe(); it should be possible to switch
> > to xpopen_both(), which would probably fix a few of the bugs.
> > (I'd not realized that there was an xpopen_both until just now.)
> > This strategy is probably the main part that will actually be useful.
> 
> Ok. (I note there was no patch attached to this...)

Yes, I figured the code as it stands is close to useless; it's more
of a proof-of-concept.

> > Now, the lame part (ie, everything else).
> > The working name is ncdl (because it's a downloader that uses netcat,
> > of course); sample usage is
> >  ncdl -u example.com:443/index.html -s -o index.html
> > You can probably see some oddities:
> > - currently, it assumes that the underlying protocol is HTTP, and does
> >   not accept proper http:// or https:// urls
> 
> That's easy enough to add. And I'd want to add ftp:// support. (Only
> implementing passive mode is probably fine. The only ftp site I've
> encountered in the past decade that _didn't_ support that was the one
> serving the DNS root zone file updates.)
> 
> This gets into the ftpget/ftpput stuff in the todo list. (Which
> aboriginal is still using the busybox commands for; uploading results of
> native-builds out of the emulator to the host system through the virtual
> network.)

If I write a downloader with http:// and ftp:// support, I want to add
gopher:// support also ;-).
The gopher protocol, client side, consists of writing ("%s\r\n", file)
to the connection and reading everything that's sent.
...yet Firefox disabled gopher:// support to reduce attack surface.
(Eyeroll)

> > - since it doesn't parse a url scheme, it uses -s to decide whether
> >   to use SSL
> > - the URL is passed via -u, rather than as an argument
> > - -o is used to select file output, as in curl
> 
> Command line stuff's not a big deal. It's the protocol I'm worried about.
> 
> HTTP 1.1 and 2.x have a metadata block at the start, with a bunch of
> "keyword: value" lines (ended with a blank line) that controls stuff
> like optional file lengths and seek/resume and multiple files per
> connection and so on. I wrote python code to generate/parse all this
> stuff ~13 years ago, but literally haven't looked at it in more than 5
> years, and I've seen it done _wrong_. (I used to fetch video podcasts
> from msnbc.com using "wget --continue" until I figured out that resuming
> partial downloads led to corrupted video files.)

HTTP 1.0 has the same metadata block; it's just that with 1.1 you almost
always need to parse it.
Which my hack doesn't do.

> > But the implementation is at least as lame:
> > - it doesn't check the status of the network client, just whether it
> >   could write to the socket/pipe connected to it
> > - it uses an HTTP/1.0 request, and doesn't bother checking
> 
> Yeah, that's so obsolete rather a lot of websites don't even support it
> anymore. (For one thing it only accesses the default site at a given IP
> address, I don't believe you can do virtual domains through 1.0? For
> another a number of sites have been dropping support for that over the
> past decade because the last browser that actually _issued_ such
> requests was back in the 90's. I noticed this because the "echo GET / |
> netcat" trick stopped working. :)

What I wrote amounted to a C version of that, plus skipping the metadata
block.
Technically, it was supposed to be more like:
 printf 'GET /\r\n\r\n' | netcat
with an optional ' HTTP/1.0' following the filename; optionally, you could
add some metadata fields.
Virtual domain support simply means reading and writing 'Host:' fields
in the metadata; that field was not mentioned in the HTTP/1.0 standard,
so a strictly conforming implementation of 1.0 would not support it.

However, if you submit
printf 'GET / HTTP/1.0\r\nHost: www.msnbc.com\r\n\r\n' | nc www.msnbc.com:80

the server responds with an HTTP/1.0 reply, containing the actual content
(insofar as what's there can be called that).

> >   Content-Length: it's intended to just read till there's no more data.
> >   In reality, it doesn't work that well. It consistently keeps trying to
> >   read after we run out of data and needs to be manually interrupted.
> > - The extent of header checking is "did we get HTTP/1.* 200 OK?".
> >   Then it discards everything till it gets a "blank line" ("\r\n\r\n",
> >   since like most network protocols HTTP needs both \r and \n).
> > - That means that it doesn't support redirects yet.
> 
> You sometimes have to use shutdown() to make http 1.0 work. (Toybox
> netcat gets it right, and I fixed it in busybox netcat way back when.)
> In theory xpopen_both() should be able to do that too.

I'd been guessing that, hence the comment about xpopen_both possibly
fixing some bugs.

Thanks,
Isaac Dunham

 1436858733.0