[Rd] download.file does not process gz files correctly (truncates them?)

Mon May 7 02:28:31 CEST 2018

Thanks for the comments, feedback, and improvements.

I still argue that the current behavior cause more harm than it helps.

First of all, it increases the risk for code that does not work on all
platforms, which I'd say is one of the strengths and design goals of
R.  To write cross-platform code, a developer basically needs to
specify argument 'mode'.

A second problem is that people who work on non-Windows platforms will
not be aware of this problem.  Yes, adding this Windows-specific
behavior to the help on all platforms will help a bit (thanks for
doing that).  However, since there are so many non-Windows users out
there that write documentation, vignettes, blog posts, host classes
and workshops, it is quite likely that you'll see things like
"Download the data file using `download.file(url, file)` and then
...".  Boom, a "beginner" on Windows will have problems and even the
non-Windows instructor may not know what's going and quickly lots of
time is wasted.

A third problem is wasted bandwidth because the same file has to be
downloaded a second time.  If the default is changed to mode="wb" and
someone truly needs mode="w", the penalty should be smaller because
such text-based files are likely to be much smaller than binary files,
which are often several GiB these days.

What could lower the risk for the above,and help the user and helpers,
is to give an informative warning whenever 'mode' is not specified,
e.g.

   The file 'NNN' is downloaded as a text file (mode = "w"). If you
meant to download it as a binary file, specify mode = "wb".

Deprecating the default mode="w" on Windows can be done in steps, e.g.
by making the argument mandatory for a while. This could be done on
all platforms because we're already all affected, i.e. we need to
specify 'mode' to avoid surprises.

Even if the default won't change, below are some more
comments/observations that is related to the current implementation of
download.file() on Windows:

ADD MORE EXTENSIONS?

What about case-insensitive matching, e.g. data.ZIP and data.Rdata?

A quick scan of the R source code suggests that R is also working with
the following filename extensions (using various case styles):

* Rbin (src/library/tools/R/install.R)
* rda, Rda (tests/reg-tests-1a.R)
* rdb (src/library/tools/R/install.R)
* rds, RDS, Rds (src/library/tools/R/install.R)
* rdx (src/library/tools/R/install.R)
* RData, Rdata, rdata (src/library/tools/R/install.R)

Should the tar extension also be added?

What about binary image formats that R produces, e.g. filename
extensions bmp, jpg, jpeg, pdf, png, tif, tiff?

What about all the other file extensions that we know for sure are binary?

VECTORIZATION:

For some value of the 'method' argument, the current implementation
will download the same file differently depending on other files
downloaded at the same time.  For example, here a PNG file is
downloaded in text mode and its content is translated:

> urls <- c("https://www.r-project.org/logo/Rlogo.png")
> download.file(urls, destfile = basename(urls), method = "libcurl")
trying URL 'https://www.r-project.org/logo/Rlogo.png'
Content length 48148 bytes (47 KB)
downloaded 47 KB
> file.size(basename(urls))
[1] 48281

But if we throw in a "known" binary extension, the PNG file be
downloaded as binary:

> urls <- c("https://www.r-project.org/logo/Rlogo.png", "https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip")
> download.file(urls, destfile = basename(urls), method = "libcurl")
trying URL 'https://www.r-project.org/logo/Rlogo.png'
trying URL 'https://cran.r-project.org/bin/windows/contrib/3.6/future_1.8.1.zip'
> file.size(basename(urls))
[1]  48148 527069

Best,

Henrik

On Fri, May 4, 2018 at 1:18 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
>>>>>> Joris Meys <jorismeys at gmail.com>
>>>>>>     on Fri, 4 May 2018 10:00:07 +0200 writes:
>
>     > On Fri, May 4, 2018 at 8:34 AM, Tomas Kalibera
>     > <tomas.kalibera at gmail.com> wrote:
>
>     >> The current heuristic/hack is in line with the
>     >> compatibility approach: it detects files that are
>     >> obviously binary, so it changes the default behavior only
>     >> for cases when it would obviously cause damage.
>     >>
>     >> Tomas
>
>
>     > Well, I was trying to download a .gz file and
>     > download.file() didn't detect that. Reason for that is
>     > obviously that the link doesn't contain .gz but %2Egz ,
>     > using the ASCII code for the dot instead of the dot
>     > itself. That's general practice in a lot of links.
>
>     > Hence I propose to change the line in download.file() that
>     > does this check to:
>
>     >   if (missing(mode) && length(grep("\\.(gz|bz2|xz|tgz|zip|rda|RData)$",
>     >       URLdecode(url))))
>
>     > using URLdecode() ensures that .gz, .RData etc will be
>     > detected correctly in an encoded URL.
>
>     > Cheers Joris
>
> Makes sense to me and I plan to add it when also adding '.rds'
>
> { OTOH, after reading the thread about this: Shouldn't you make
>   your code more robust and use   mode = "wb" (or "ab") in any case?
>   ;-)
> }
>
> Martin
>