[BioC] background correcting an eset from GEO query

Sean Davis seandavi at gmail.com
Wed Jan 6 20:41:28 CET 2010

On Wed, Jan 6, 2010 at 2:26 PM, jeremy wilson <jeremy.wilson88 at gmail.com> wrote:
> Hi Sean,
> thanks for your reply.
> I agree with you. It would be great if GEO makes submitting RAW data
> mandatory so the user can work on the data as he/she wishes.

Just to be clear, they now do.  That was not always the case, though.
The data set below is from 2004.

> I have a question regarding the dataset GSE1437 or GDS1512. It is a
> curated dataset as the NCBI claims. If we look at the individual
> histograms, I see they are log transformed and may be background
> corrected and normalized but am not sure as I see long tails with
> expression values upto 2^250 which is very weird after bg correction.
> The first three which are controls are good but the samples with
> factors gives very high expression values.

Again, the description of the data handling is given in the GSM
records, but it isn't as complete as one might like.  If you really
want to normalize the way you want, you will probably need to get the
raw data from the original authors.

> library(GEOquery)
> gse<-getGEO("GSE1437")
> e<-exprs(gse$GSE1437_series_matrix.txt.gz)
> hist(e[,1], main="histogram of expression values", xlab="log
> tranformed expression values")
> .
> .
> .
> .
> .
> hist(e[,9], main="histogram of expression values", xlab="log
> tranformed expression values")
> These high expression values which I am not sure if they are real
> messes up my analysis (the true differentially expressed genes are
> subdued and filtered out with the presence of these high expression
> values). I can not remove all these high expression values as I do not
> know which ones are faulty and which ones are real. Do you think these
> values are not weird?
> The link http://www.ncbi.nlm.nih.gov/projects/geo/gds/profileGraph.cgi?gds=1512
>  shows a box plot of expression values for all 9 samples which are in
> the limits of 0 - 3 in log scale which contradicts the expression
> values I see from the histogram (~ 0-250) .
> I wanted to make sure before contacting the author.

I would say that it is really up to you, but there are always going to
be some question about what was down and how adequate it is when faced
with processed data (unless one uses bioconductor tools and produces
something that is meant to be reproducible : )).


> Awaiting for your advise...
> Thank you
> On Tue, Jan 5, 2010 at 5:41 PM, Sean Davis <seandavi at gmail.com> wrote:
>> On Tue, Jan 5, 2010 at 7:24 PM, jeremy wilson <jeremy.wilson88 at gmail.com> wrote:
>>> Dear BioConductors,
>>> I am using GEO query to get valuable datasets from GEO database for my
>>> analysis. Most of the datasets I require have not submitted raw data
>>> and I have to rely on the SOFT files to get the expression set
>>> directly using the "getGEO" command. When I plot the intensities of
>>> the expression values, I see that none of them I tired are given
>>> preprocessed (background corrected, normalized) as I see negative
>>> expression values and not normalized or log transformed data (my
>>> apologies if I am wrong).
>>> for example:
>>> gse<-getGEO("GSE1984")
>>> e<-exprs(gse$GSE1984_series_matrix.txt.gz)
>>> hist(e[,1], main="histogram of expression values", xlab="Untranformed
>>> expression values")
>>> elog=log2(e)
>>> hist(elog[,1], main="histogram of expression values", xlab="log
>>> tranformed expression values")
>>> esetOrig<-gse$GSE1984_series_matrix.txt.gz
>>> hist(esetOrig)
>>> We can clearly see that from the histograms that the arrays are not
>>> normalized. The same is true for GSE4465 dataset and etc.
>> The data are described on the GEO website.  For example, see:
>> http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSM35348
>> Note that the VALUE column is described as "Affymetrix Signal".  In
>> this particular case, you would probably still need to contact the
>> original investigator to know exactly what this means, but you may
>> correct that this may not represent an adequately normalized value.
>> However, the raw data ARE available for GSE1984.  You can get them like so:
>> getGEOSuppFiles('GSE1984')
>> This will download a tar file full of .CEL files.  You can use the
>> normal bioconductor affy tools to work with the data and then
>> transplant the phenodata from the ExpressionSet you created above to
>> the resulting new ExpressionSet.
>>> I am assuming the data we get in the expression set from getGEO for
>>> datasets like these are hence just the RAW intensity values summarized
>>> at probeset level some how but not bg corrected and normalized between
>>> arrays. I would hence like to do these steps one by one on the eset. I
>>> search the web for packages that do bg correction and normalization on
>>> eset. I did find the  normalize.ExpressionSet but could not find a bg
>>> correction method for eset. I think it may not be possible to do a bg
>>> correction on eset as there is no spatial positional information for
>>> probes or probesets in eset unlike affybatch object to do a bg
>>> correction.
>>> In case there is no bg correction method for an eset, please suggest
>>> me how to proceed from an eset from GEO query to a bg corrected,
>>> normalized eset.
>> Unfortunately, there is no standard for GSE records, except that the
>> values are _supposed_ to be normalized in some fashion by the
>> investigators.  In most cases, they are, but that may not mean that
>> they would be normalized the same way if done by another person.  You
>> can either use the values in the GSE record (not the GSEMatrix), if
>> those values allow you to renormalize, or you will need to download
>> the raw data.  If neither is available, then you are stuck writing to
>> the authors and hoping for the best.  As a note, GDS records are truly
>> normalized (GEO checks this), so those are generally a good bet if a
>> GDS is available.
>> Hope that helps.
>> Sean
>>> I would greatly appreciate your help. Thank you
>>> SessionInfo()
>>> R version 2.10.0 (2009-10-26)
>>> i386-pc-mingw32
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>>> States.1252
>>> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_United States.1252
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices datasets  utils
>>> methods   base
>>> other attached packages:
>>>  [1] convert_1.21.1       marray_1.23.0        geneplotter_1.23.3
>>> lattice_0.17-26
>>>  [5] annotate_1.23.4      AnnotationDbi_1.7.20 genefilter_1.26.4
>>> affyPLM_1.22.0
>>>  [9] preprocessCore_1.7.9 gcrma_2.17.4         affy_1.23.12
>>> GEOquery_2.11.2
>>> [13] RCurl_1.2-1          bitops_1.0-4.1       limma_3.0.3
>>> Biobase_2.5.8
>>> loaded via a namespace (and not attached):
>>>  [1] affyio_1.13.5      Biostrings_2.13.54 DBI_0.2-4
>>> grid_2.10.0        IRanges_1.3.99
>>>  [6] RColorBrewer_1.0-2 RSQLite_0.7-3      splines_2.10.0
>>> survival_2.35-7    xtable_1.5-5
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list