[BioC] GEOquery Error : Retrieved files corrupted?

Sean Davis sdavis2 at mail.nih.gov
Tue Feb 7 20:00:10 CET 2012


On Tue, Feb 7, 2012 at 11:42 AM, ying chen <ying_chen at live.com> wrote:
>
> Hi, I tried to retrieve GEO dataset with the GEOquery package as following:
>  file <- getGEOSuppFiles('GSE10046')
> But it seems that every raw data file I got by this method is corrupted. For example, when I tried to extract the GSE10046_RAW.tar, I got the following error message:                               Can not open file "H:\...\GSE10046_RAW.tar" as archive. The GSE10046_RAW.tar I got through GEOquery is 27,433 KB. The same dataset I retrieved from GEO website is 27,350KB and I can extract it with no problem. I had retrieved more than 70 dataset raw files by GEOquery and all have the same problem. Anyone has any suggestion what went wrong? Thanks a lot for the help! Ying >

Hi, Ying.

I am not able to reproduce your error on either Mac or two flavors of
linux.  I don't have access to a Windows version of R, but I'll see if
I can get access in the next few days to check.

Sorry I can't be more helpful right now.
Sean



> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252    attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
> [1] GEOquery_2.20.8 Biobase_2.14.0 loaded via a namespace (and not attached):
> [1] RCurl_1.9-5.1 XML_3.9-4.1
>>
>> From: ying_chen at live.com
>> To: sdavis2 at mail.nih.gov
>> Date: Mon, 6 Feb 2012 11:35:07 -0500
>> CC: bioconductor at r-project.org
>> Subject: Re: [BioC] GEOquery Error
>>
>>
>> Hi Sean, Thanks a lot for the help. I checked my computer and I still have 253GB space left on my hard drive. I tried to retrieve the data over the weekend, but always had the same problem. I just tried to run it again to test on 10 gse ids. At first it gave some error message, but finished the first dataset. Then the program complained about the failure to open the destfile, which seems odd to me as this is the file the program is supposed to download. Now it seems to me that I can download dataset one by one using getGEOSuppFiles, but it always failed if I tried to use sapply with GetGEOSuppFiles to set up to download a list of datasets.  Any suggestion? Thanks a lot for the help! Ying
>>  > files <- sapply(gseids[1:10],getGEOSuppFiles)
>> Error in dir.create(GEO) : invalid 'path' argument
>> [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
>> trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_RAW.tar'
>> ftp data connection made, file length 605009920 bytes
>> opened URL
>> downloaded 577.0 Mbtrying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_discovery_clinical_info.txt.gz'
>> ftp data connection made, file length 1785 bytes
>> opened URL
>> downloaded 1785 bytestrying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//GSE30010_validation_clinical_info.txt.gz'
>> ftp data connection made, file length 1681 bytes
>> opened URL
>> downloaded 1681 bytestrying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010//filelist.txt'
>> ftp data connection made, file length 5871 bytes
>> opened URL
>> downloaded 5871 bytesError in dir.create(GEO) : invalid 'path' argument
>> [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE12790/"
>> Error in download.file(file.path(url, i), destfile = file.path(storedir,  :
>>   cannot open destfile 'H:/My_DataSets/BreastCancerDataSet/GSE12790/GSE12790_RAW.tar', reason 'No such file or directory'
>> > sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-pc-mingw32/x64 (64-bit)locale:
>> [1] LC_COLLATE=English_United States.1252
>> [2] LC_CTYPE=English_United States.1252
>> [3] LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C
>> [5] LC_TIME=English_United States.1252    attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
>> [1] GEOquery_2.20.8     Biobase_2.14.0      BiocInstaller_1.2.1loaded via a namespace (and not attached):
>> [1] RCurl_1.9-5.1 tools_2.14.0  XML_3.9-4.1
>>  > files <- sapply(gseids[4:10],getGEOSuppFiles)
>> Error in dir.create(GEO) : invalid 'path' argument
>> [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
>> Error in function (type, msg, asError = TRUE)  :
>>   Server denied you to change to the given directory
>> > files <- getGEOSuppFiles('GSE9195')
>> [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195/"
>> trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_RAW.tar'
>> ftp data connection made, file length 658708480 bytes
>> opened URL
>> downloaded 628.2 Mbtrying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION.RData'
>> ftp data connection made, file length 59288200 bytes
>> opened URL
>> downloaded 56.5 Mbtrying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
>> Error in download.file(file.path(url, i), destfile = file.path(storedir,  :
>>   cannot open URL 'ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE9195//GSE9195_TAMVALIDATION_README.txt'
>> >
>>     > Date: Thu, 2 Feb 2012 23:46:59 -0500
>> > Subject: Re: [BioC] GEOquery Error
>> > From: sdavis2 at mail.nih.gov
>> > To: ying_chen at live.com
>> > CC: bioconductor at r-project.org
>> >
>> > On Thu, Feb 2, 2012 at 11:37 PM, ying chen <ying_chen at live.com> wrote:
>> > > Hi Sean,
>> > >
>> > > Thanks a lot for the suggestion. I just tried simple test (> files <-
>> > > getGEOSuppFiles("GSE23720")) and the problem is gone.
>> > >
>> > > But when I tried to get a lot files at once, I got the following error
>> > > message:
>> > >
>> > >> gseids
>> > >   [1] GSE17907 GSE30010 GSE12790 GSE20711 GSE28821 GSE18864 GSE9195
>> > > GSE29431
>> > >   [9] GSE14020 GSE7904  GSE18728 GSE15181 GSE16391 GSE12777 GSE23593
>> > > GSE22035
>> > >  [17] GSE19383 GSE10281 GSE21217 GSE29672 GSE14986 GSE15026 GSE12763
>> > > GSE11001
>> > >  [25] GSE14017 GSE22513 GSE7515  GSE28796 GSE26910 GSE23994 GSE19639
>> > > GSE19697
>> > >  [33] GSE15477 GSE10270 GSE3893  GSE13787 GSE11078 GSE8977  GSE21834 GSE6885
>> > >  [41] GSE24468 GSE20266 GSE21422 GSE3156  GSE22250 GSE18571 GSE11352 GSE7382
>> > >  [49] GSE13806 GSE8565  GSE15619 GSE8597  GSE29832 GSE11791 GSE5102
>> > > GSE28645
>> > >  [57] GSE32160 GSE28789 GSE18331 GSE23640 GSE23399 GSE9086  GSE22865
>> > > GSE26298
>> > >  [65] GSE15893 GSE20086 GSE11324 GSE5116  GSE10879 GSE25407 GSE7700
>> > > GSE18912
>> > >  [73] GSE15043 GSE27515 GSE19777 GSE21832 GSE18070 GSE11506 GSE23921
>> > > GSE23905
>> > >  [81] GSE32158 GSE28305 GSE25162 GSE28415 GSE9015  GSE6800  GSE6548
>> > > GSE32161
>> > >  [89] GSE24249 GSE30775 GSE26884 GSE24473 GSE20719 GSE17636 GSE18773
>> > > GSE18931
>> > >  [97] GSE18146 GSE16070 GSE16080 GSE11683 GSE10046 GSE9747  GSE15749
>> > > GSE22664
>> > > [105] GSE21066 GSE9586  GSE17832 GSE11330 GSE17889 GSE12199 GSE28089
>> > > GSE31448
>> > > [113] GSE10810 GSE9196  GSE22840 GSE33658 GSE25487 GSE22544 GSE27220
>> > > GSE11581
>> > > 120 Levels: GSE10046 GSE10270 GSE10281 GSE10810 GSE10879 GSE11001 ...
>> > > GSE9747
>> > >> files <- sapply(gseids,getGEOSuppFiles,makeDirectory = TRUE, baseDir =
>> > >> getwd()
>> > > + )
>> > > Error in dir.create(GEO) : invalid 'path' argument
>> > > [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE17907/"
>> > >   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> > > Current
>> > >                                  Dload  Upload   Total   Spent    Left
>> > > Speed
>> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
>> > > 0Warning: Failed to create the file
>> > > Warning:
>> > > /media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/GSE17907_RA
>> > > Warning: W.tar: No such file or directory
>> > >   0  328M    0  2896    0     0   3027      0 31:34:35 --:--:-- 31:34:35
>> > > 3415
>> > > curl: (23) Failed writing body (0 != 2896)
>> > >   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> > > Current
>> > >                                  Dload  Upload   Total   Spent    Left
>> > > Speed
>> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
>> > > 0Warning: Failed to create the file
>> > > Warning:
>> > > /media/Passport01/My_DataSets/BreastCancerDataSet/GSE17907/filelist.tx
>> > > Warning: t: No such file or directory
>> > >  24  5979   24  1448    0     0   2495      0  0:00:02 --:--:--  0:00:02
>> > > 3061
>> > > curl: (23) Failed writing body (0 != 1448)
>> > > Error in dir.create(GEO) : invalid 'path' argument
>> > > In addition: Warning messages:
>> > > 1: In download.file(file.path(url, i), destfile = file.path(storedir,  :
>> > >   download had nonzero exit status
>> > > 2: In download.file(file.path(url, i), destfile = file.path(storedir,  :
>> > >   download had nonzero exit status
>> > > [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE30010/"
>> > >   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> > > Current
>> > >                                  Dload  Upload   Total   Spent    Left
>> > > Speed
>> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
>> > > 0Warning: Failed to create the file
>> > > Warning:
>> > > /media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_RA
>> > > Warning: W.tar: No such file or directory
>> > >   0  576M    0  2896    0     0   5191      0 32:22:29 --:--:-- 32:22:29
>> > > 6464
>> > > curl: (23) Failed writing body (0 != 2896)
>> > >   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> > > Current
>> > >                                  Dload  Upload   Total   Spent    Left
>> > > Speed
>> > >   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
>> > > 0Warning: Failed to create the file
>> > > Warning:
>> > > /media/Passport01/My_DataSets/BreastCancerDataSet/GSE30010/GSE30010_di
>> > > Warning: scovery_clinical_info.txt.gz: No such file or directory
>> > >  81  1785 81 1448 0     0   3009      0 --:--:-- --:--:-- --:--:--
>> > > 3506
>> > >  81  1785 81 1448 0     0   1978      0 --:--:-- --:--:-- --:--:--
>> > > 1978curl: (23) Failed writing body (0 != 1448)
>> >
>> > It is hard to tell for sure, but I think you might be out of disk
>> > space locally.  When you get the error, check to see if you have space
>> > left on the device to which you are saving.  GEOquery should work fine
>> > in a loop like this.
>> >
>> > Sean
>> >
>> >
>> > > After I killed this job and tried:
>> > >
>> > >> file <- getGEOSuppFiles("GSE17907")
>> > >
>> > > I had no problem at all.
>> > >
>> > > I really do not know what's wrong with the sapply() setting.
>> > >
>> > > Any suggestion?
>> > >
>> > > Thanks a lot for the help!
>> > >
>> > > Ying
>> > >
>> > >> Date: Thu, 2 Feb 2012 12:48:56 -0500
>> > >> Subject: Re: [BioC] GEOquery Error
>> > >> From: sdavis2 at mail.nih.gov
>> > >> To: ying_chen at live.com
>> > >> CC: bioconductor at r-project.org
>> > >
>> > >>
>> > >> On Thu, Feb 2, 2012 at 12:38 PM, ying chen <ying_chen at live.com> wrote:
>> > >> >
>> > >> >
>> > >> >
>> > >> > Hi,
>> > >> >
>> > >> > I want to use GEOquery package to get the raw files of a lot GEO
>> > >> > datasets at once ( > files <- sapply(mydata$GSE_ID, getGEOSuppFiles) ), but
>> > >> > I got the following error message when I did a simple test run. Any
>> > >> > suggestion?
>> > >> >
>> > >>
>> > >> Hi, Ying.
>> > >>
>> > >> This is not a GEOquery issue. The directory housing the data is not
>> > >> on the FTP site. NCBI GEO periodically rebuilds stuff on the site.
>> > >> That might be occurring now. I'd suggest emailing NCBI GEO directly
>> > >> if you are in a hurry. Alternatively, wait an hour or two to see if
>> > >> the problem is resolved.
>> > >>
>> > >> Sean
>> > >>
>> > >>
>> > >> >> library(GEOquery)
>> > >> > Loading required package: Biobase
>> > >> > Welcome to Bioconductor
>> > >> >  Vignettes contain introductory material. To view, type
>> > >> >  'browseVignettes()'. To cite Bioconductor, see
>> > >> >  'citation("Biobase")' and for packages 'citation("pkgname")'.
>> > >> > Setting options('download.file.method.GEOquery'='curl')
>> > >> >> files <- getGEOSuppFiles("GSE23720")
>> > >> > [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE23720/"
>> > >> > Error in function (type, msg, asError = TRUE)  :
>> > >> >  Server denied you to change to the given directory
>> > >> >> sessionInfo()
>> > >> > R version 2.14.1 (2011-12-22)
>> > >> > Platform: x86_64-pc-linux-gnu (64-bit)
>> > >> > locale:
>> > >> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> > >> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> > >> >  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> > >> >  [7] LC_PAPER=C                 LC_NAME=C
>> > >> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> > >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> > >> > attached base packages:
>> > >> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> > >> > other attached packages:
>> > >> > [1] GEOquery_2.20.8 Biobase_2.14.0
>> > >> > loaded via a namespace (and not attached):
>> > >> > [1] RCurl_1.9-5 XML_3.9-4
>> > >> >>
>> > >> >
>> > >> >
>> > >> >        [[alternative HTML version deleted]]
>> > >> >
>> > >> > _______________________________________________
>> > >> > Bioconductor mailing list
>> > >> > Bioconductor at r-project.org
>> > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > >> > Search the archives:
>> > >> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list