[BioC] Annotation for a GEO data set

Sean Davis seandavi at gmail.com
Mon Mar 22 17:39:31 CET 2010


On Thu, Mar 4, 2010 at 12:30 PM, Joern Grame <gjormac at googlemail.com> wrote:
> Dear Bioconductor,
>
> I have a question concerning a GEO data set. I have downloaded the data into
> R using the GEOquery package. I'm trying to map the Affymetrix probe ids
> onto gene symbols, but can't find the appropriate annotation data. Following
> some of the tutorials, using the annotate package should help, but what I
> get from the function annotation is the GEO platform identifier:
>
>> library(GEOquery)
>> library(annotation)
>> data <- GEOquery(GEO='GSE13639')
>> annotation(data)
> [1] "GPL570"
>
> I'd like to use functions like getSYMBOL, but I don't know which mapping
> package to install.  Help will be much appreciated.

Hi, Joern.  GPL570 is represented in Bioconductor as hgu133plus2.db.
You can get this the old-fashioned way by looking up GPL570 in GEO and
then going to the Bioconductor website to find the right package by
hand.  Alternatively, you may use the GEOmetadb package to get the
information directly:

> library(GEOmetadb)
> sqlfile = getSQLiteFile()
> con = dbConnect("SQLite",sqlfile)
> dbGetQuery(con,"select gpl,title,bioc_package from gpl where gpl='GPL570'")

Then, you are off to the races....

> biocLite('hgu133plus2.db')

will get you the correct package.

However, your "data" object already has the annotation information
from NCBI GEO in it:

> colnames(fData(data))
 [1] "ID"                               "GB_ACC"
 [3] "SPOT_ID"                          "Species.Scientific.Name"
 [5] "Annotation.Date"                  "Sequence.Type"
 [7] "Sequence.Source"                  "Target.Description"
 [9] "Representative.Public.ID"         "Gene.Title"
[11] "Gene.Symbol"                      "ENTREZ_GENE_ID"
[13] "RefSeq.Transcript.ID"             "Gene.Ontology.Biological.Process"
[15] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"

> fData(data)$Gene.Symbol[1:10]
 [1] DDR1   RFC2   HSPA6  PAX8   GUCA1A UBA7   THRA   PTPN21 CCL5   CYP2E1
20828 Levels: ADAM32 AFG3L1 ALG10 ARMCX4 ATP6V1E2 BEST4 C15orf40 ... FAM86B1

> fData(data)["1007_s_at",]$Gene.Symbol
[1] DDR1
20828 Levels: ADAM32 AFG3L1 ALG10 ARMCX4 ATP6V1E2 BEST4 C15orf40 ... FAM86B1

Hope that helps.

Sean



More information about the Bioconductor mailing list