[BioC] FW: GEOquery package

Ochsner, Scott A sochsner at bcm.edu
Tue Aug 30 17:48:06 CEST 2011


Jing,

Here is where you have to be very careful.  The metadata does seem to indicate that the data are log2 and that RMA has been utilized.  As this dataset is from Affymetrix, I would expect log2 values to be in the range of 2 to 16.  From what little you have shown us this appears to be the case.  Safest bet is to import the .CEL files if available and normalize yourself.  I've come across a few datasets archived in GEO in which the journal article describes a normalization procedure which is not consistent with what is described in GEO metadata which is not consistent with the actual data.  I have truly found that with GEO data, buyer beware.

Scott   


Scott A. Ochsner, PhD
One Baylor Plaza BCM130, Houston, TX 77030
Voice: (713) 798-6227  Fax: (713) 790-1275 
-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Jing Huang
Sent: Tuesday, August 30, 2011 10:36 AM
To: 'bioconductor at r-project.org'
Subject: [BioC] GEOquery package

Dear Sean and all members,

I am trying to extract GSE data from GEO and do analysis. I am wondering if the GSE data has been normalized and log 2 transformed. R scripts and output are copied below.  Can somebody help me on this?

>Table(GSMList(gse)[[1]])[1:5, ]
     ID_REF       VALUE
1 1007_s_at 7.693888187
2   1053_at 8.571408272
3    117_at 5.179812431
4    121_at 7.468027592
5 1255_g_at 3.118550777
> Columns(GSMList(gse)[[1]])[1:5, ]
     Column                Description
1    ID_REF
2     VALUE log2 signal intensity, RMA       <<<<< Does this means that the value is log2 transformed and the data was         normalized by RMA
NA     <NA>                       <NA>
NA.1   <NA>                       <NA>
NA.2   <NA>                       <NA>

According to GEOquery package I should do following steps in order to get the eset:

> probesets <- Table(GPLList(gse)[[1]])$ID
> data.matrix <- do.call("cbind", lapply(GSMList(gse), function(x) {
+ tab <- Table(x)
+ mymatch <- match(probesets, tab$ID_REF)
+ return(tab$VALUE[mymatch])
+ }))
> data.matrix <- apply(data.matrix, 2, function(x) {
+ as.numeric(as.character(x))
+ })
> data.matrix <- log2(data.matrix)
> data.matrix[1:5, ]

     GSM424759 GSM424760 GSM424761 GSM424762 GSM424763 GSM424764 GSM424765
[1,]  2.943713  2.917086  2.926155  2.983485  2.973219  2.962445  2.926030
[2,]  3.099532  3.136898  3.152696  3.217172  3.206948  3.198448  3.135146
[3,]  2.372900  2.309177  2.354380  2.373350  2.368464  2.381139  2.314555
[4,]  2.900727  2.873853  2.863911  2.879232  2.927384  2.913594  2.852870
[5,]  1.640876  1.645330  1.494274  1.792643  1.719597  1.648126  1.605055

Is the log2 transformation  necessary for this dataset?
Many thanks

Jing


	[[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list