[BioC] GEOquery and GEO issues

Christian.Stratowa@vie.boehringer-ingelheim.com Christian.Stratowa at vie.boehringer-ingelheim.com
Mon Jan 23 11:18:15 CET 2006

Dear Sean 

While trying to find a parser for the GEO soft files I encoutered your
GEOquery package which works great. 
Nevertheless, I would like to mention two issues which might be of general

1, Memory problems: 
I have downloaded from GEO the file 'GSE2109_family.soft.gz' first (due to
our proxy settings I cannot use 
getGEO for this purpose) and then imported it into R with: 
gse2109 <- getGEO(filename='GSE2109_family.soft.gz') 
Although I have succeeded in importing the file into R, it took 39.3 hours
on a 64 bit Opteron machine with 
16 GB RAM and used 9.7 GB RAM. The final .Rdata file has a size of 2.0 GB. 
Maybe, a future version of GEOquery could reduce both time and memory

2, Non-unique GEO platforms: 
I have also downloaded our own CLL dataset 'GSE2466_family.soft.gz' where we
had to use both the 
Affymetrix HGU95A and HGU95Av2 chips. In my personal opinion it is a serious
flaw of the GEO 
database that it declares both chips as single platform GPL91. 
In your description of the GEOquery package, chapter 4.3 Converting GSE to
an exprSet, you supply 
code to make sure that all of the GSMs are from the same platform (see my
small function below). 
Sorrowly, this is not sufficient in this case (and probably other Affymetrix
chips where two versions exist). 
Even though the Sample_data_row_count is different (12625 vs 12626) cbind
simply recylces the rows. 
In this case, I could test if Sample_data_row_count is identical for all
chips, but theoretically there may 
be the case that different chip versions may still have the same number of
probe sets. 
One possibility would be that GEO forces the submitters not only to supply
Sample_platform_id, but 
also a "Sample_platform_title" which would contain the name of the chip as
given by the manufacturer. 

3, Sample descriptions: 
Since most data are useless w/o the sample description, which contains the
clinical data, it would 
be helpful if GEO would supply a certain format for adding the clinical
data, so that it would be 
possible to write a parser to extract these data automatically into a table.

Best regards 

Attached function: 
table4GEO <- function(gse, column="VALUE", lg2=T){ 
# (c) Christian Stratowa   created: 01/19/2006   last modified: 01/19/2006 
# Get sample table of columns "column" for GEO Series GSExxxx 
# gse:    GEOqueryclass imported from GEO GSE file GSExxxx_family.soft (or
# column: name of column to be extracted from data table 

#  load libraries 

#  get list 
   gsm <- GSMList(gse); 

#  check number of platforms (must be one platform only) 
   tmp <- unlist(lapply(gsm, function(x) {Meta(x)$platform})); 
   if (length(unique(tmp)) != 1) { 
      stop("Data must belong to one platform ID only!"); 

#  number of samples 
   size <- length(tmp); 
   print(paste("Number of samples:",size)) 

#  check if all samples have the chosen column 
   tmp <- unlist(lapply(gsm, function(x) {which(Columns(x)[,1] ==
   if (length(tmp) != size) { 
      stop(paste("Only <", length(tmp), "> of <", size, "> samples have
column ", column)); 

#  get "column" from all chips 
   data <- do.call("cbind", lapply(gsm, function(x){Table(x)[,column]})); 
   dimnames(data)[[1]] <- Table(gsm[[1]])$ID_REF 

   if (lg2==TRUE) { 
      data <- log2(data); 


Christian Stratowa, PhD 
Boehringer Ingelheim Austria 
Dept NCE Lead Discovery - Bioinformatics 
Dr. Boehringergasse 5-11 
A-1121 Vienna, Austria 
Tel.: ++43-1-80105-2470 
Fax: ++43-1-80105-2782 
email: christian.stratowa at vie.boehringer-ingelheim.com

More information about the Bioconductor mailing list