[BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data

Wed Jun 27 18:40:31 CEST 2012

Dear Sean and Gustavo,

I cannot reproduce this error. See below.

On 27/06/12 16:54, Sean Davis wrote:
> On Wed, Jun 27, 2012 at 11:38 AM, Gustavo FernÃ¡ndez BayÃ³n
> <gbayon at gmail.com>wrote:
>
>> Hi again.
>>
>> I would like to add a little bit more of information on this issue. I have
>> been debugging inside the parseGSEMatrix() function in GEOquery source
>> code. The suspicious NA's appeared when execution arrived to the following
>> line:
>>
>> ## Apparently, NCBI GEO uses case-insensitive matching
>> ## between platform IDs and series ID Refs ???
>> dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),]
>>
>>
>>
>> The problem here is that 'datamat' has the correct number of rows, which
>> is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the
>> series matrix file while 'dat' comes from the GPL.
>>
>> If you go to the GEO page of that GPL (
>> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534),
>> you'll find it says that the GPL decryption table has exactly 485577 rows,
>> which is kind of logical, a description for each probeset. However, inside
>> the code, 'dat' has only 143889 rows.
>>
>> Replicating directly from R console:
>>
>>> gpl <- getGEO('GPL13534',destdir='../../GEO/')
>>> Meta(gpl)$data_row_count
>> [1] "485577"
>>
>>> t <- Table(gpl)
>>> dim(t)
>> [1] 143889 37
>>
>>
>>
>> I was really surprised to find this, and I do not have enough knowledge as
>> to know if it responds to an unknown constraint I happen to ignore. Is that
>> ok? Or is there any bug in the GPL processing code? Now I'm going home, but
>> I'll try to continue debugging to see what is really happening inside.
>>
>>
> This is most likely a bug in GPL parsing.  There are A LOT of edge cases
> that I have tried to deal with, some not very appropriately.  Often, the
> error is due to an extraneous quote in an unexpected location.  I'll look
> into this one.  Could you do me a favor and send along sessionInfo() just
> so I know?
>
> Thanks,
> Sean
>
>
>
>> Any help will be very much appreciated.
>>
>> Regards,
>> Gus
>>
>>
>> ---------------------------
>> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
>>
>>
>> El miÃ©rcoles 27 de junio de 2012 a las 10:51, Gustavo FernÃ¡ndez BayÃ³n
>> escribiÃ³:
>>
>>> Hi everybody.
>>>
>>> I am experiencing quite a few problems while trying to download and
>> parse a dataset of methylation values. These are not technical problems,
>> IMHO. GEOquery works perfectly, and it really makes getting this kind of
>> data an easy task. However, I think I do not understand exactly the
>> lifecycle of GEO series data, and I would like to ask in this list for any
>> hint on this behavior, so I could try to fix it.
>>>
>>> What I first did was to download and parse the desired GSE data file,
>> with the default value of GSMMatrix parameter (TRUE). Besides, I extracted
>> the ExpressionSet and the assayData I was looking for.
>>>
>>> my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/')
>>> my.expr.set <- my.gse[[1]]
>>> beta.values <- exprs(my.expr.set)
>>>
>>> What really gave me a surprise at first, was to see many strange values
>> (all containing the 'NA' string) in the featureNames of the expression set.
>>>
>>>> head(featureNames(es), n=20)
>>> [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3"
>>> [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734"
>>> [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12"
>>>
>>>
>>>
>>> If I select an individual GSM in the series, and download it, the
>> featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I
>> get a list of GSM data sets, and the results is again good. This made me
>> suspect of the intermediate, pre-parsed, matrix form. I haven't found a
>> clue about the lifecycle of this kind of data. I mean, how the matrix is
>> built. Is it a manual process? Is it automatic?
>>>
>>> If it is a manual process, then I guess I will have to contact the
>> responsible of uploading the data to see if they can fix it. But, if it is
>> not, I would like to know if this is something relating to BioC or, more
>> plausibly, to GEO.
>>>
>>> Any help would be appreciated.
>>>
>>> Regards,
>>> Gustavo
>>>
>>>
>>> ---------------------------
>>> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

library(GEOquery)

my.gse <- getGEO('GSE30870', destdir=".")

featureNames(my.gse[[1]])[1:10]
# [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
# [6] "cg00000289" "cg00000292" "cg00000321" "cg00000363" "cg00000622"
all(featureNames(my.gse[[1]]) == rownames(exprs(my.gse[[1]])))
#[1] TRUE

gpl <- getGEO('GPL13534',destdir=".")
Meta(gpl)$data_row_count == nrow(Table(gpl))
# [1] TRUE

 > sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  tools     methods
[8] base

other attached packages:
[1] GEOquery_2.23.5    Biobase_2.16.0     BiocGenerics_0.2.0

loaded via a namespace (and not attached):
[1] RCurl_1.91-1 XML_3.9-4

HTH,
J.