[BioC] GEOquery - was queryGEO fails on GDS files (GEO Datasets)

Peter bioconductor-mailinglist at maubp.freeserve.co.uk
Wed Jan 11 20:29:12 CET 2006


Sean Davis wrote:
 >Peter,
 >
 >I have recently uploaded a new package to bioconductor called GEOquery.

I've had a little play - very nice work.  Cheers.  Just a few 
queries/questions for you...

I never did work out how to load the package from the source files, but 
I noticed there is now a Windows binary package on the website...

http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html

I downloaded the ZIP file and installed it on Windows XP with R 2.1.1 
and got the following warning:

package 'GEOquery' successfully unpacked and MD5 sums checked
updating HTML package descriptions
Warning message:
no package 'file15658' was found in: packageDescription(i, fields = 
"Title", lib.loc = lib)

Question One
------------
Is the above "no package" warning important?

-------------------------------------------------------------------

Question Two
------------

 > library(GEOquery)
Warning message:
package 'GEOquery' was built under R version 2.3.0

Does the version of R matter?  I assume R version 2.3.0 is the 
development version of R, as 2.2.1 is the latest official release.

-------------------------------------------------------------------

Question Three
--------------

 > gds37 <- getGEO('GDS37', destdir="c:/temp/geo")
trying URL 'ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS37.soft.gz'
ftp data connection made, file length 132384 bytes
opened URL
downloaded 129Kb

File stored at:
c:/temp/geo/GDS37.soft.gz
c:/temp/geo/GDS37.soft.gz
parsing geodata
parsing subsets
ready to return

Why does it print the file location twice?

-------------------------------------------------------------------

Question Four
-------------
If I repeat the command getGEO, why does it re-download the file?

 > gds37 <- getGEO('GDS37', destdir="c:/temp/geo")

I would personally have written the getGEO code to check in the 
destination folder for the files GDS37.soft or GDS37.soft.gz and just 
load the local copy if it existed.

I know I should use the following instead:

 > gds37 <- getGEO(filename="c:/temp/geo/gds37.soft.gz")


-------------------------------------------------------------------

Question Five
-------------
I like how you have handled converting subset information into phenotype 
data in GDS2eSet.

Have you considered also parsing the "description" to extract the 
"Alternative Sample Name" and the "Sample Source"?

As far as I can tell, all the current NCBI GDS files use the same format 
for the description lines:

"Value for SAMPLENAME: ALTNAME; src: SOURCE"

On the other hand, this is clearly not a "defined field" and is subject 
to change.  Maybe automatically parse the lines if and only if it 
follows that format?

-------------------------------------------------------------------

Thanks again - GEOquery looks like it will be very handy...

Peter



More information about the Bioconductor mailing list