[BioC] GEOquery and GEO issues

Mon Jan 23 14:04:00 CET 2006

On 1/23/06 5:18 AM, "Christian.Stratowa at vie.boehringer-ingelheim.com"
<Christian.Stratowa at vie.boehringer-ingelheim.com> wrote:

> Dear Sean 
> 
> While trying to find a parser for the GEO soft files I encoutered your
> GEOquery package which works great.
> Nevertheless, I would like to mention two issues which might be of general
> interest: 
> 
> 1, Memory problems:
> I have downloaded from GEO the file 'GSE2109_family.soft.gz' first (due to
> our proxy settings I cannot use
> getGEO for this purpose) and then imported it into R with:
> gse2109 <- getGEO(filename='GSE2109_family.soft.gz')
> Although I have succeeded in importing the file into R, it took 39.3 hours
> on a 64 bit Opteron machine with
> 16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0 GB.
> Maybe, a future version of GEOquery could reduce both time and memory
> consumption. 

This is obviously a problem with large GSEs.

> 2, Non-unique GEO platforms:
> I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' where we
> had to use both the
> Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a serious
> flaw of the GEO 
> database that it declares both chips as single platform GPL91.
> In your description of the GEOquery package, chapter 4.3 Converting GSE to
> an exprSet, you supply
> code to make sure that all of the GSMs are from the same platform (see my
> small function below).
> Sorrowly, this is not sufficient in this case (and probably other Affymetrix
> chips where two versions exist).
> Even though the Sample_data_row_count is different (12625 vs 12626) cbind
> simply recylces the rows.
> In this case, I could test if Sample_data_row_count is identical for all
> chips, but theoretically there may
> be the case that different chip versions may still have the same number of
> probe sets. 
> One possibility would be that GEO forces the submitters not only to supply
> Sample_platform_id, but
> also a "Sample_platform_title" which would contain the name of the chip as
> given by the manufacturer.

Just to clarify--I am in no way affiliated with GEO and have no control over
the way their database functions or what is stored in it.  I have simply
tried to provide a means to easily parse as much of GEO data as possible.

As for your situation, this is easily remedied:

Instead of using 'cbind' blindly (which assumes that the GPL and the data
are in the same order, which they need not be), use match first.  In fact,
that is probably the safest way to do things--I'll change the vignette.
Something like this:

 probesets <- Table(GPLList(gse)[[1]])$ID

 dat <- do.call('cbind',lapply(GSMList(gse),function(x)
    {tab <- Table(x)
     mymatch <- match(probesets,tab$ID_REF)
     return(tab$VALUE[mymatch])
     }
    )
   )

> 
> 3, Sample descriptions:
> Since most data are useless w/o the sample description, which contains the
> clinical data, it would
> be helpful if GEO would supply a certain format for adding the clinical
> data, so that it would be
> possible to write a parser to extract these data automatically into a table.

Again, I do not have any control over what GEO does with regard to clinical
annotation.  Where the clinical data is present, it should be possible to
write a specific function or set of functions to extract it; writing a
general function to do this is currently not possible for GSEs for the
reason that you note--there isn't a specified format.

I hope this clarifies things a bit.  Thanks for the constructive feedback.

Sean