[Toybox] Working on a [currently lame] downloader

Mon Jul 13 22:21:28 PDT 2015

On 07/13/2015 01:18 AM, Isaac Dunham wrote:
> Hello,
> I've been working on an HTTP(S) downloader (what wget does, but currently
> completely incompatible with everything) for toybox.
> Currently it works to some degree, so I thought I'd mention that it's
> in progress, ask for a general idea of what's desired, and give people
> an idea of how completely lame it is right now and how I'm doing it.
> 
> I presume that the agenda for toybox is implementing some subset of wget
> in a compatible manner; is "what busybox wget supports + SSL" a rough
> approximation of the desired functionality?

Yup.

> I mentioned that it's HTTP(S); it fetches files over SSL without
> implementing SSL. I cheated on networking: it calls netcat or 
> openssl s_client -quiet -connect.

Huh, I didn't know about that. Cool.

> It uses an approach roughly similar to xpopen_both(), except that
> it uses socketpair() instead of pipe(); it should be possible to switch
> to xpopen_both(), which would probably fix a few of the bugs.
> (I'd not realized that there was an xpopen_both until just now.)
> This strategy is probably the main part that will actually be useful.

Ok. (I note there was no patch attached to this...)

> Now, the lame part (ie, everything else).
> The working name is ncdl (because it's a downloader that uses netcat,
> of course); sample usage is
>  ncdl -u example.com:443/index.html -s -o index.html
> You can probably see some oddities:
> - currently, it assumes that the underlying protocol is HTTP, and does
>   not accept proper http:// or https:// urls

That's easy enough to add. And I'd want to add ftp:// support. (Only
implementing passive mode is probably fine. The only ftp site I've
encountered in the past decade that _didn't_ support that was the one
serving the DNS root zone file updates.)

This gets into the ftpget/ftpput stuff in the todo list. (Which
aboriginal is still using the busybox commands for; uploading results of
native-builds out of the emulator to the host system through the virtual
network.)

> - it has no idea what default ports are (so you need to specify even
>   port 80 for HTTP or port 443 for HTTPS)

Again, easy enough to fix.

> - since it doesn't parse a url scheme, it uses -s to decide whether
>   to use SSL
> - the URL is passed via -u, rather than as an argument
> - -o is used to select file output, as in curl

Command line stuff's not a big deal. It's the protocol I'm worried about.

HTTP 1.1 and 2.x have a metadata block at the start, with a bunch of
"keyword: value" lines (ended with a blank line) that controls stuff
like optional file lengths and seek/resume and multiple files per
connection and so on. I wrote python code to generate/parse all this
stuff ~13 years ago, but literally haven't looked at it in more than 5
years, and I've seen it done _wrong_. (I used to fetch video podcasts
from msnbc.com using "wget --continue" until I figured out that resuming
partial downloads led to corrupted video files.)

> But the implementation is at least as lame:
> - it doesn't check the status of the network client, just whether it
>   could write to the socket/pipe connected to it
> - it uses an HTTP/1.0 request, and doesn't bother checking

Yeah, that's so obsolete rather a lot of websites don't even support it
anymore. (For one thing it only accesses the default site at a given IP
address, I don't believe you can do virtual domains through 1.0? For
another a number of sites have been dropping support for that over the
past decade because the last browser that actually _issued_ such
requests was back in the 90's. I noticed this because the "echo GET / |
netcat" trick stopped working. :)

>   Content-Length: it's intended to just read till there's no more data.
>   In reality, it doesn't work that well. It consistently keeps trying to
>   read after we run out of data and needs to be manually interrupted.
> - The extent of header checking is "did we get HTTP/1.* 200 OK?".
>   Then it discards everything till it gets a "blank line" ("\r\n\r\n",
>   since like most network protocols HTTP needs both \r and \n).
> - That means that it doesn't support redirects yet.

You sometimes have to use shutdown() to make http 1.0 work. (Toybox
netcat gets it right, and I fixed it in busybox netcat way back when.)
In theory xpopen_both() should be able to do that too.

> And I'm sure there are many more bugs besides.

I look forward to it.

Rob

 1436851288.0