[BioC] getGEO - getting the .CEL files from GEO

Thu Mar 18 12:49:56 CET 2010

On 17 March 2010 16:51, Sean Davis <seandavi at gmail.com> wrote:
> 2010/3/17 Vincent Carey <stvjc at channing.harvard.edu>:
>> do you really want to put sample-characteristics data in a CEL file?
>>
>> the sample characteristics are available as follows:
>>
>>  ff = getGEO("GSE4045")
>>
>>> table(pData(ff[[1]])$descr)
>>
>>        conventional colorectal tumor, mucinous, Dukes Stage c, MSS,
>> no cancer in the family, male, Distal Location , Tumor Grade 2
>>
>>                                                           1
>>  conventional colorectal tumor, non-mucinous, Dukes Stage b, MSS, no
>> cancer in the family, female, Distal Location , Tumor Grade 2
>>
>>                                                           1
>> conventional colorectal tumor, non-mucinous, Dukes Stage c, MSI, no
>> cancer in the family, female, Proximal Location , Tumor Grade 3
>>
>>                                                           1
>> ....
>>
>> and you will have to parse that 'description' field to extract stage
>> and other relevant information.  for example
>>
>> de = as.character(ff[[1]]$desc
>> gr = gsub(".*, Tumor Grade.(.)$", "\\1", de)
>>
>> gives you a single character string for grade, except for sample 14 --
>> where my regexp doesn't do as much as it should.
>>
>> such activities would be used to populate an annotated data frame
>> which could then serve as the phenoData component of an AffyBatch
>> instance, which is a typical container for CEL-based intensity data,
>> to be propagated downstream through background correction and
>> normalization and so forth.  The experimentData element should also be
>> suitably populated, as early in the workflow as possible.  If we look
>> closely enough we can find that the ExpressionSet returned by getGEO
>> has quantifications generated by MAS 5.0.
>>
>> On Wed, Mar 17, 2010 at 11:27 AM, 張 語恬 <greengarden_0925 at hotmail.com> wrote:
>>>
>>>
>>> Hi:
>>>
>>> I've download  the GSE CEL files from GEO. But I have trouble in adding the individual charateristics, such as tumor site, age, gender...and so on, to the CEL file.
>>>
>>> I've read the mail of [BioC] getGEO - getting the .CEL files from GEO,but still not understood.
>>>
>>> Could you use GSE4045 as an example to demonstrate
>>> how to use the exprs(), I can find the instrucion in the mailing list, to replace the GSE4045.SOFT  with the CEL raw microarray data and keep the characteristics left.
>>>
>
> There are a couple of tricks here that can sometimes be useful to get
> better annotation.  In this case, they are not a big improvement.
>
> The GEO GSE data entity contains information as supplied by the
> submitters.  The GDS data entity contains data taken from GSE records
> that have been further curated by GEO staff.  Often, that leads to
> more useful annotation than comma-separated lists (although the
> information is usually the same or similar, at least).  To give an
> example of how one might learn of the existence of such a GDS given a
> GSE, one can use the GEOmetadb package:
>
> library(GEOmetadb)
> # Next command will take a minute....
> sqlfile = getSQLiteFile()
> # Check to see if the GSE record has a corresponding
> # GDS record
> geoConvert('GSE4045','gds')
>
> This series of commands will result in the following:
>
> $gds
>  from_acc  to_acc
> 1  GSE4045 GDS2201
>
> So, GSE4045 has been curated by NCBI GEO staff and the accession of
> the curated data is GDS2201.  We can check to see what subsets
> (phenotypic variables) are available using GEOmetadb, but we have to
> resort to writing SQL to do so:
>
> # make a connection to the database
> conn = dbConnect('SQLite',sqlfile)
> dbGetQuery(conn,"select
> gds_subset.gds,gds_subset.description,gds_subset.type from gds_subset
> where gds='GDS2201'")
>
> One can use the columnDescriptions() function to get a data.frame of
> columns, tables, and descriptions if writing SQL is necessary.  This
> will return this small data.frame:
>
>      gds                       description          type
> 1 GDS2201     serrated colerectal carcinoma disease state
> 2 GDS2201 conventional colorectal carcinoma disease state
>
> So, unfortunately, the GEO staff has annotated only the two different
> types of colorectal carcinoma and not the other clinical variables.
> If this is all you wanted, then you can use getGEO('GDS2201') to get
> the annotations and attach those to the ExpressionSet that you create
> by normalizing the .CEL files of your choosing.  If not, then Vince's
> method is the way to go.
>
> Sean
>

It's also worth noting that ArrayExpress have imported much of the
data from common Affymetrix platforms (and some other platforms) from
GEO. These imported data sets have generally been put through a basic
curation step which does improve the computability of the annotation
somewhat. The general rule is that for a GEO series GSENNNN then the
ArrayExpress entry is E-GEOD-NNNN:

library(ArrayExpress)
abatch <- ArrayExpress('E-GEOD-4045')

Not that it makes a huge difference in this case, but this is a pretty
good workaround when a GDS set is not available in GEO.

Cheers,

Tim

-- 
(former AE curator)
Bioinformatician, Smith Lab
CIMR, University of Cambridge