[BioC] Normalization of array data from GEO repository

Tue Jul 14 11:58:21 CEST 2009

Hi,

have a look to the AE FAQ: 
http://www.ebi.ac.uk/microarray/doc/help/faq.html#submitter_FAQ_general

*How much over-lap is there between ArrayExpress and the Gene Expression 
Omnibus (GEO)?*
We import data on a weekly basis from GEO (NCBI). As a priority all GEO 
experiments which are in GEO datasets on catalogue Affymetrix and 
Agilent platforms are imported and we re-curate these before loading 
into ArrayExpress. We also import all GSE on these platforms and these 
are loaded uncurated if they pass our quality checks (e.g. no corrupt 
data files). All experiments imported from GEO have accession numbers in 
the format of E-GEOD-n, where n is a number. For more information see 
the http://www.ebi.ac.uk/microarray/doc/help/GEO_data.html

I had a more detailed look at the "HG-U133A" chip type. There I found an 
overlap of more than 90%. Especially all the new experiments are 
available in AE, too. Using R and Bioconductor for analyses, I 
recognized that the file format in AE is more suitable.

Best
Markus

James F. Reid schrieb:
> Hi,
>
> care: this is my understanding and I might be quite wrong.
>
> There is indeed no synchronization between the two databases for lack 
> of a common standard (each have their own flavour of MAGE-ML).
> In addition to investigators submitting to both repositories, 
> ArrayExpress also imports experiments from GEO according to certain 
> criteria. These are prefixed by 'E-GEOD' in the experiment ID. 
> Querying ArrayExpress for these returns 5155 such experiments out of a 
> total of 8372. GEO contains 12810 Series (experiments), so GEO does 
> contain more data I would say.
>
> HTH,
> James.
>
>
> Sean Davis wrote:
>> On Wed, Jul 8, 2009 at 6:16 AM, Joern Toedling 
>> <Joern.Toedling at curie.fr>wrote:
>>
>>> Hello,
>>>
>>> just a small addendum: you may also want to have a look at the 
>>> ArrayExpress
>>> package which allows the user to retrieve data sets from the 
>>> ArrayExpress
>>> database at EBI and returns the data in form of an AffyBatch, 
>>> NChannelSet,
>>> RGList or the like. Since GEO and ArrayExpress are regularly 
>>> synchronized,
>>> you
>>> may be able to find your data sets of interest there as well.
>>>
>>
>> Actually, ArrayExpress and GEO are NOT synchronized.  There are some
>> overlaps where investigators have submitted to both and for other 
>> reasons,
>> but GEO is still the larger of the two and they each contain largely
>> non-overlapping data sets.
>>
>>
>>> Regards,
>>> Joern
>>>
>>>
>>> On Tue, 7 Jul 2009 13:59:19 -0400, Steve Lianoglou wrote
>>>> Hi,
>>>>
>>>> On Jul 7, 2009, at 5:38 AM, [WINDOWS-1252?]AleÅ¡ Maver wrote:
>>>>
>>>>> Hi all,
>>>>> I have obtained several GEO Series (GSE) entries from GEO repository
>>>>> using
>>>>> getGEO function (GEOquery package).
>>>>> Data obtained in this manner is stored in ExpressionSet class. The
>>>>> problem
>>>>> is I don't know how to perform quality control analyses and
>>>>> normalization
>>>>> procedures on ExpressionSet data, because functions like expresso
>>>>> (affy
>>>>> package) work only on AffyBatch classes. Is there anything I am
>>>>> missing?
>>>> Sorry, I've never used the GEOquery package before, so I can't speak
>>>>  much to that, but I'd be surprised if there isn't an option to
>>>> return  your results as an AffyBatch object, because I'd dare say
>>>> that you can  get most of the data from geo in its raw format (eg,
>>>> CEL file or  whatever).
>>>>
>>>>> And- does anyone know whether data in GEO repository is already
>>>>> normalised
>>>>> or not?
>>>> It depends, sometimes you aren't given the raw files: sometimes the
>>>> data is from a custom array, or I've also seen some datasets
>>>> provided  in the post-processed form (already MAS5 normalized, for
>>>> example), but  it's been my experience that you can get the raw data
>>>> for most of the  experiments you find there.
>>>>
>>>> Also, for array quality assessment, look into the
>>>> arrayQualityMetrics  package:
>>>>
>>>>
>>> http://www.bioconductor.org/packages/release/bioc/html/arrayQualityMetrics.html 
>>>
>>>> Hope that helps,
>>>> -steve
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Dipl.-Tech. Math. Markus Schmidberger

Ludwig-Maximilians-Universität München
IBE - Institut für medizinische Informationsverarbeitung,
Biometrie und Epidemiologie
Marchioninistr. 15, D-81377 Muenchen
URL: http://www.ibe.med.uni-muenchen.de 
Mail: Markus.Schmidberger [at] ibe.med.uni-muenchen.de
Tel: +49 (089) 7095 - 4497