[BioC] GEOquery Error : Retrieved files corrupted?

axel.klenk at actelion.com axel.klenk at actelion.com
Wed Feb 8 09:56:53 CET 2012


Dear Ying and Sean,

a wild guess based on the problem description that sounds too familiar: 
corruption of binary files is likely to occur if they are transferred via 
ftp text 
mode instead of binary mode from Linux/UNIX to Windows. 
Hmmm, but then getGEOSuppFiles() would never have worked on Windows... 
maybe something has changed recently in GEOquery or the underlying code
for file transfer?

Cheers, 

 - axel


Axel Klenk
Research Informatician
Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / 
Switzerland




From:
ying chen <ying_chen at live.com>
To:
<sdavis2 at mail.nih.gov>
Cc:
bioconductor at r-project.org
Date:
07.02.2012 20:18
Subject:
Re: [BioC] GEOquery Error : Retrieved files corrupted?
Sent by:
bioconductor-bounces at r-project.org




Hi Sean, Thanks a lot for the help. I switched to ubuntu on virtualbox and 
now have no problem with raw data retrieved through GEOquery. But I just 
repeated in Windows 7 with R2.14, and my problem is still there. But now 
at least I can stick with ubuntu. Thanks, Ying 
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
 [7] LC_PAPER=C                 LC_NAME=C 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C 
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       attached base 
packages:
[1] stats     graphics  grDevices utils     datasets  methods   base other 
attached packages:
[1] GEOquery_2.20.8 Biobase_2.14.0 loaded via a namespace (and not 
attached):
[1] RCurl_1.9-5 XML_3.9-4 
> 
> Date: Tue, 7 Feb 2012 14:00:10 -0500
> Subject: Re: [BioC] GEOquery Error : Retrieved files corrupted?
> From: sdavis2 at mail.nih.gov
> To: ying_chen at live.com
> CC: bioconductor at r-project.org
> 
> On Tue, Feb 7, 2012 at 11:42 AM, ying chen <ying_chen at live.com> wrote:
> >
> > Hi, I tried to retrieve GEO dataset with the GEOquery package as 
following:
> >  file <- getGEOSuppFiles('GSE10046')
> > But it seems that every raw data file I got by this method is 
corrupted. For example, when I tried to extract the GSE10046_RAW.tar, I 
got the following error message:                               Can not 
open file "H:\...\GSE10046_RAW.tar" as archive. The GSE10046_RAW.tar I got 
through GEOquery is 27,433 KB. The same dataset I retrieved from GEO 
website is 27,350KB and I can extract it with no problem. I had retrieved 
more than 70 dataset raw files by GEOquery and all have the same problem. 
Anyone has any suggestion what went wrong? Thanks a lot for the help! Ying 
>
> 
> Hi, Ying.
> 
> I am not able to reproduce your error on either Mac or two flavors of
> linux.  I don't have access to a Windows version of R, but I'll see if
> I can get access in the next few days to check.
> 
> Sorry I can't be more helpful right now.
> Sean
> 
> 
> 
> > sessionInfo()
> > R version 2.14.0 (2011-10-31)
> > Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
> > [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
States.1252
> > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United States.1252    attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base  
other attached packages:
> > [1] GEOquery_2.20.8 Biobase_2.14.0 loaded via a namespace (and not 
attached):
> > [1] RCurl_1.9-5.1 XML_3.9-4.1
> >>
> >> From: ying_chen at live.com
> >> To: sdavis2 at mail.nih.gov
> >> Date: Mon, 6 Feb 2012 11:35:07 -0500
> >> CC: bioconductor at r-project.org
> >> Subject: Re: [BioC] GEOquery Error
> >>
> >>
> >> Hi Sean, Thanks a lot for the help. I checked my computer and I still 
have 253GB space left on my hard drive. I tried to retrieve the data over 
the weekend, but always had the same problem. I just tried to run it again 
to test on 10 gse ids. At first it gave some error message, but finished 
the first dataset. Then the program complained about the failure to open 
the destfile, which seems odd to me as this is the file the program is 
supposed to download. Now it seems to me that I can download dataset one 
by one using getGEOSuppFiles, but it always failed if I tried to use 
sapply with GetGEOSuppFiles to set up to download a list of datasets.  Any 
suggestion? Thanks a lot for the help! Ying
> >>  > files <- sapply(gseids[1:10],getGEOSuppFiles)
> >> Error in dir.create(GEO) : invalid 'path' argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
> >> trying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_RAW.tar'
> >> ftp data connection made, file length 605009920 bytes
> >> opened URL
> >> downloaded 577.0 Mbtrying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_discovery_clinical_info.txt.gz'
> >> ftp data connection made, file length 1785 bytes
> >> opened URL
> >> downloaded 1785 bytestrying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_validation_clinical_info.txt.gz'
> >> ftp data connection made, file length 1681 bytes
> >> opened URL
> >> downloaded 1681 bytestrying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//filelist.txt'
> >> ftp data connection made, file length 5871 bytes
> >> opened URL
> >> downloaded 5871 bytesError in dir.create(GEO) : invalid 'path' 
argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE12790/"
> >> Error in download.file(file.path(url, i), destfile = 
file.path(storedir,  :
> >>   cannot open destfile 
'H:/My_DataSets/BreastCancerDataSet/GSE12790/GSE12790_RAW.tar', reason 'No 
such file or directory'
> >> > sessionInfo()
> >> R version 2.14.0 (2011-10-31)
> >> Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
> >> [1] LC_COLLATE=English_United States.1252
> >> [2] LC_CTYPE=English_United States.1252
> >> [3] LC_MONETARY=English_United States.1252
> >> [4] LC_NUMERIC=C
> >> [5] LC_TIME=English_United States.1252    attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base  
  other attached packages:
> >> [1] GEOquery_2.20.8     Biobase_2.14.0      BiocInstaller_1.2.1loaded 
via a namespace (and not attached):
> >> [1] RCurl_1.9-5.1 tools_2.14.0  XML_3.9-4.1
> >>  > files <- sapply(gseids[4:10],getGEOSuppFiles)
> >> Error in dir.create(GEO) : invalid 'path' argument
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
> >> Error in function (type, msg, asError = TRUE)  :
> >>   Server denied you to change to the given directory
> >> > files <- getGEOSuppFiles('GSE9195')
> >> [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
> >> trying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_RAW.tar'
> >> ftp data connection made, file length 658708480 bytes
> >> opened URL
> >> downloaded 628.2 Mbtrying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION.RData'
> >> ftp data connection made, file length 59288200 bytes
> >> opened URL
> >> downloaded 56.5 Mbtrying URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
> >> Error in download.file(file.path(url, i), destfile = 
file.path(storedir,  :
> >>   cannot open URL 
'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
> >> >
> >>     > Date: Thu, 2 Feb 2012 23:46:59 -0500
> >> > Subject: Re: [BioC] GEOquery Error
> >> > From: sdavis2 at mail.nih.gov
> >> > To: ying_chen at live.com
> >> > CC: bioconductor at r-project.org
> >> >
> >> > On Thu, Feb 2, 2012 at 11:37 PM, ying chen <ying_chen at live.com> 
wrote:
> >> > > Hi Sean,
> >> > >
> >> > > Thanks a lot for the suggestion. I just tried simple test (> 
files <-
> >> > > getGEOSuppFiles("GSE23720")) and the problem is gone.
> >> > >
> >> > > But when I tried to get a lot files at once, I got the following 
error
> >> > > message:
> >> > >
> >> > >> gseids
> >> > >   [1] GSE17907 GSE30010 GSE12790 GSE20711 GSE28821 GSE18864 
GSE9195
> >> > > GSE29431
> >> > >   [9] GSE14020 GSE7904  GSE18728 GSE15181 GSE16391 GSE12777 
GSE23593
> >> > > GSE22035
> >> > >  [17] GSE19383 GSE10281 GSE21217 GSE29672 GSE14986 GSE15026 
GSE12763
> >> > > GSE11001
> >> > >  [25] GSE14017 GSE22513 GSE7515  GSE28796 GSE26910 GSE23994 
GSE19639
> >> > > GSE19697
> >> > >  [33] GSE15477 GSE10270 GSE3893  GSE13787 GSE11078 GSE8977 
GSE21834 GSE6885
> >> > >  [41] GSE24468 GSE20266 GSE21422 GSE3156  GSE22250 GSE18571 
GSE11352 GSE7382
> >> > >  [49] GSE13806 GSE8565  GSE15619 GSE8597  GSE29832 GSE11791 
GSE5102
> >> > > GSE28645
> >> > >  [57] GSE32160 GSE28789 GSE18331 GSE23640 GSE23399 GSE9086 
GSE22865
> >> > > GSE26298
> >> > >  [65] GSE15893 GSE20086 GSE11324 GSE5116  GSE10879 GSE25407 
GSE7700
> >> > > GSE18912
> >> > >  [73] GSE15043 GSE27515 GSE19777 GSE21832 GSE18070 GSE11506 
GSE23921
> >> > > GSE23905
> >> > >  [81] GSE32158 GSE28305 GSE25162 GSE28415 GSE9015  GSE6800 
GSE6548
> >> > > GSE32161
> >> > >  [89] GSE24249 GSE30775 GSE26884 GSE24473 GSE20719 GSE17636 
GSE18773
> >> > > GSE18931
> >> > >  [97] GSE18146 GSE16070 GSE16080 GSE11683 GSE10046 GSE9747 
GSE15749
> >> > > GSE22664
> >> > > [105] GSE21066 GSE9586  GSE17832 GSE11330 GSE17889 GSE12199 
GSE28089
> >> > > GSE31448
> >> > > [113] GSE10810 GSE9196  GSE22840 GSE33658 GSE25487 GSE22544 
GSE27220
> >> > > GSE11581
> >> > > 120 Levels: GSE10046 GSE10270 GSE10281 GSE10810 GSE10879 GSE11001 
...
> >> > > GSE9747
> >> > >> files <- sapply(gseids,getGEOSuppFiles,makeDirectory = TRUE, 
baseDir =
> >> > >> getwd()
> >> > > + )
> >> > > Error in dir.create(GEO) : invalid 'path' argument
> >> > > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE17907/"
> >> > >   % Total    % Received % Xferd  Average Speed   Time    Time 
Time
> >> > > Current
> >> > >                                  Dload  Upload   Total   Spent 
Left
> >> > > Speed
> >> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- 
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > > 
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/GSE17907_RA
> >> > > Warning: W.tar: No such file or directory
> >> > >   0  328M    0  2896    0     0   3027      0 31:34:35 --:--:-- 
31:34:35
> >> > > 3415
> >> > > curl: (23) Failed writing body (0 != 2896)
> >> > >   % Total    % Received % Xferd  Average Speed   Time    Time 
Time
> >> > > Current
> >> > >                                  Dload  Upload   Total   Spent 
Left
> >> > > Speed
> >> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- 
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > > 
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/filelist.tx
> >> > > Warning: t: No such file or directory
> >> > >  24  5979   24  1448    0     0   2495      0  0:00:02 --:--:-- 
0:00:02
> >> > > 3061
> >> > > curl: (23) Failed writing body (0 != 1448)
> >> > > Error in dir.create(GEO) : invalid 'path' argument
> >> > > In addition: Warning messages:
> >> > > 1: In download.file(file.path(url, i), destfile = 
file.path(storedir,  :
> >> > >   download had nonzero exit status
> >> > > 2: In download.file(file.path(url, i), destfile = 
file.path(storedir,  :
> >> > >   download had nonzero exit status
> >> > > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
> >> > >   % Total    % Received % Xferd  Average Speed   Time    Time 
Time
> >> > > Current
> >> > >                                  Dload  Upload   Total   Spent 
Left
> >> > > Speed
> >> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- 
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > > 
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_RA
> >> > > Warning: W.tar: No such file or directory
> >> > >   0  576M    0  2896    0     0   5191      0 32:22:29 --:--:-- 
32:22:29
> >> > > 6464
> >> > > curl: (23) Failed writing body (0 != 2896)
> >> > >   % Total    % Received % Xferd  Average Speed   Time    Time 
Time
> >> > > Current
> >> > >                                  Dload  Upload   Total   Spent 
Left
> >> > > Speed
> >> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- 
--:--:--
> >> > > 0Warning: Failed to create the file
> >> > > Warning:
> >> > > 
/media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_di
> >> > > Warning: scovery_clinical_info.txt.gz: No such file or directory
> >> > >  81  1785 81 1448 0     0   3009      0 --:--:-- --:--:-- 
--:--:--
> >> > > 3506
> >> > >  81  1785 81 1448 0     0   1978      0 --:--:-- --:--:-- 
--:--:--
> >> > > 1978curl: (23) Failed writing body (0 != 1448)
> >> >
> >> > It is hard to tell for sure, but I think you might be out of disk
> >> > space locally.  When you get the error, check to see if you have 
space
> >> > left on the device to which you are saving.  GEOquery should work 
fine
> >> > in a loop like this.
> >> >
> >> > Sean
> >> >
> >> >
> >> > > After I killed this job and tried:
> >> > >
> >> > >> file <- getGEOSuppFiles("GSE17907")
> >> > >
> >> > > I had no problem at all.
> >> > >
> >> > > I really do not know what's wrong with the sapply() setting.
> >> > >
> >> > > Any suggestion?
> >> > >
> >> > > Thanks a lot for the help!
> >> > >
> >> > > Ying
> >> > >
> >> > >> Date: Thu, 2 Feb 2012 12:48:56 -0500
> >> > >> Subject: Re: [BioC] GEOquery Error
> >> > >> From: sdavis2 at mail.nih.gov
> >> > >> To: ying_chen at live.com
> >> > >> CC: bioconductor at r-project.org
> >> > >
> >> > >>
> >> > >> On Thu, Feb 2, 2012 at 12:38 PM, ying chen <ying_chen at live.com> 
wrote:
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > Hi,
> >> > >> >
> >> > >> > I want to use GEOquery package to get the raw files of a lot 
GEO
> >> > >> > datasets at once ( > files <- sapply(mydata$GSE_ID, 
getGEOSuppFiles) ), but
> >> > >> > I got the following error message when I did a simple test 
run. Any
> >> > >> > suggestion?
> >> > >> >
> >> > >>
> >> > >> Hi, Ying.
> >> > >>
> >> > >> This is not a GEOquery issue. The directory housing the data is 
not
> >> > >> on the FTP site. NCBI GEO periodically rebuilds stuff on the 
site.
> >> > >> That might be occurring now. I'd suggest emailing NCBI GEO 
directly
> >> > >> if you are in a hurry. Alternatively, wait an hour or two to see 
if
> >> > >> the problem is resolved.
> >> > >>
> >> > >> Sean
> >> > >>
> >> > >>
> >> > >> >> library(GEOquery)
> >> > >> > Loading required package: Biobase
> >> > >> > Welcome to Bioconductor
> >> > >> >  Vignettes contain introductory material. To view, type
> >> > >> >  'browseVignettes()'. To cite Bioconductor, see
> >> > >> >  'citation("Biobase")' and for packages 'citation("pkgname")'.
> >> > >> > Setting options('download.file.method.GEOquery'='curl')
> >> > >> >> files <- getGEOSuppFiles("GSE23720")
> >> > >> > [1] "
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE23720/"
> >> > >> > Error in function (type, msg, asError = TRUE)  :
> >> > >> >  Server denied you to change to the given directory
> >> > >> >> sessionInfo()
> >> > >> > R version 2.14.1 (2011-12-22)
> >> > >> > Platform: x86_64-pc-linux-gnu (64-bit)
> >> > >> > locale:
> >> > >> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >> > >> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >> > >> >  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >> > >> >  [7] LC_PAPER=C                 LC_NAME=C
> >> > >> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> > >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >> > >> > attached base packages:
> >> > >> > [1] stats     graphics  grDevices utils     datasets  methods  
base
> >> > >> > other attached packages:
> >> > >> > [1] GEOquery_2.20.8 Biobase_2.14.0
> >> > >> > loaded via a namespace (and not attached):
> >> > >> > [1] RCurl_1.9-5 XML_3.9-4
> >> > >> >>
> >> > >> >
> >> > >> >
> >> > >> >        [[alternative HTML version deleted]]
> >> > >> >
> >> > >> > _______________________________________________
> >> > >> > Bioconductor mailing list
> >> > >> > Bioconductor at r-project.org
> >> > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> > >> > Search the archives:
> >> > >> > 
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>       [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >        [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor
  
                 [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor




The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged.
It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email.
The content of this email is not legally binding unless confirmed by letter.
Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com



More information about the Bioconductor mailing list