[Rd] download.file does not process gz files correctly (truncates them?)

Joris Meys jori@mey@ @ending from gm@il@com
Thu May 3 11:48:37 CEST 2018


Dear all,

I've been diving a bit deeper into this per request of Tomas Kalibra, and
found the following :

- the lock on the file is only after trying to read it using oligo, so
that's not a R problem in itself. The problem is independent of extrenal
packages.

- using Windows' fc utility and cygwin's cmp utility I found out that every
so often the download.file() function inserts an extra byte. There's no
real obvious pattern in how these bytes are added, but the file downloaded
using download.file() is actually larger (in this case by about 8 kb). The
file xxx_inR.CEL.gz is read in using:

setwd("E:/Temp/genexpr/Compare")
id <- "GSM907854"
flink <- paste0("
https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz
")
fname <- paste0(id,"_inR.CEL.gz")
download.file(flink,
              destfile = fname)

The file xxx_direct.CEL.gz is downloaded from
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link
at the bottom of the page).

Output of dir in CMD:

05/03/2018  11:02 AM         4,529,547 GSM907854_direct.CEL.gz
05/03/2018  11:17 AM         4,537,668 GSM907854_inR.CEL.gz

or from R :

> diff(file.size(dir())) # contains both CEL files.
[1] 8121

Strangely enough I get the following message from download.file() :

Content type 'application/octet-stream' length 4529547 bytes (4.3 MB)
downloaded 4.3 MB

So the reported length is exactly the same as if I would download the file
directly, but the file on disk itself is larger. So it seems
download.file() is adding bytes when saving the data on disk.  This
behaviour is independent of antivirus and/or firewalls turned on or off.

Also keep in mind that these are NOT standard gzipped files. These files
are a specific format for Affymetrix Human Gene 1.0 ST Arrays.

If I need to run other tests, please let me know.
Kind regards

Joris

On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorismeys at gmail.com> wrote:

> Dear all,
>
> I've noticed by trying to download gz files from here :
> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811
>
> At the bottom one can download GSM907811.CEL.gz . If I download this
> manually and try
>
> oligo::read.celfiles("GSM907811.CEL.gz")
>
> everything works fine. (oligo is a bioConductor package)
>
> However, if I download using
>
> download.file("https://www.ncbi.nlm.nih.gov/geo/download/
> ?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz",
>               destfile = "GSM907811.CEL.gz")
>
> The file is downloaded, but oligo::read.celfiles() returns the following
> error:
>
> Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) :
>   End of gz file reached unexpectedly. Perhaps this file is truncated.
>
> Moreover, if I try to delete it after using download.file(), I get a
> warning that permission is denied. I can only remove it using Windows file
> explorer after I closed the R session, indicating that the connection is
> still open. Yet, showConnections() doesn't show any open connections either.
>
> Session info below. Note that I started from a completely fresh R session.
> oligo is needed due to the specific file format of these gz files. They're
> not standard tarred files.
>
> Cheers
> Joris
>
> Session Info
> ------------------------------------------------------------
> -------------------------
>
> R version 3.5.0 (2018-04-23)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows >= 8 x64 (build 9200)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices utils     datasets
> methods
> [9] base
>
> other attached packages:
>  [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8
> oligo_1.44.0
>  [4] Biobase_2.39.2             oligoClasses_1.42.0
> RSQLite_2.1.0
>  [7] Biostrings_2.48.0          XVector_0.19.9
> IRanges_2.13.28
> [10] S4Vectors_0.17.42          BiocGenerics_0.25.3
>
> loaded via a namespace (and not attached):
>  [1] Rcpp_0.12.16                compiler_3.5.0
>  [3] BiocInstaller_1.30.0        GenomeInfoDb_1.15.5
>  [5] bitops_1.0-6                iterators_1.0.9
>  [7] tools_3.5.0                 zlibbioc_1.25.0
>  [9] digest_0.6.15               bit_1.1-12
> [11] memoise_1.1.0               preprocessCore_1.41.0
> [13] lattice_0.20-35             ff_2.2-13
> [15] pkgconfig_2.0.1             Matrix_1.2-14
> [17] foreach_1.4.4               DelayedArray_0.5.31
> [19] yaml_2.1.18                 GenomeInfoDbData_1.1.0
> [21] affxparser_1.52.0           bit64_0.9-7
> [23] grid_3.5.0                  BiocParallel_1.13.3
> [25] blob_1.1.1                  codetools_0.2-15
> [27] matrixStats_0.53.1          GenomicRanges_1.31.23
> [29] splines_3.5.0               SummarizedExperiment_1.9.17
> [31] RCurl_1.95-4.10             affyio_1.49.2
>
>
> --
> Joris Meys
> Statistical consultant
>
> Department of Data Analysis and Mathematical Modelling
> Ghent University
> Coupure Links 653, B-9000 Gent (Belgium)
>
> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>
> -----------
> Biowiskundedagen 2017-2018
> http://www.biowiskundedagen.ugent.be/
>
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>



-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]




More information about the R-devel mailing list